Big Data Text Automation on Small Machines

Big Data Text Automation on Small Machines Dealing with Big Data may require powerful machines and vast storage. Focusing on a specific type of data, however, it is possible to use commodity computers in a small lab environment to handle Big Data generated from large complex graphs such as online social networks (OSNs) and other types of Internet applications, and produce useful information. We are focusing on text data in…

Read More

Distributed or Network-Based Big Data?

Introduction Big Data is a term closely related with the development of the Internet. Due to the existence of such a global network, it is possible to share and collectthis huge amount of data as well as to provide Big Data processing and analytics as a service to a broader audience throughout the entire network. Big Data means data that’s too big, too fast, or too hard for existing tools…

Read More
blog 2

MapReduce/Hadoop solution to secondary sort

This section provides a complete MapReduce implementation of the secondary sort problem using the Hadoop framework.   Input The input will be a set of files, where each record (line) will have the following format: Format: <year><,><month><,><day><,><temperature> Example: 2012, 01, 01, 35 2011, 12, 23, -4   Expected output The expected output will have the following format: Format: <year><-><month>: <temperature1><,><temperature2><,> … where temperature1 <= temperature2 <= … Example: 2012-01: 5,…

Read More

Secondary sort: Introduction

A secondary sort problem relates to sorting values associated with a key in the reduce phase. Sometimes, it is called value-to-key conversion. The secondarysorting technique will enable us to sort the values (in ascending or descending order) passed to each reducer. Concrete examples will be provided of how toachieve secondary sorting in ascending or descending order. The goal is to implement the Secondary Sort design pattern in MapReduce/Hadoop and Spark….

Read More

Hadoop and Spark

Hadoop is the de facto standard for implementation of MapReduce applications. It is composed of one or more master nodes and any number of slave nodes. Hadoop simplifies distributed applications by saying that “the data center is the computer,” and by providing map() and reduce() functions (defined by the programmer) that allow application developers or programmers to utilize those data centers. Hadoop implements the MapReduce paradigm efficiently and is quite…

Read More

What is MapReduce?

MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a cluster environment. The term MapReduce originated from functional programming and was introduced by Google in a paper called “MapReduce: Simplified Data Processing on Large Clusters.” Google’s MapReduceimplementation is a proprietary solution and has not yet been released to the public. A simple view of the MapReduce process is illustrated in the figure…

Read More

Resource estimation and optimization

So far, the cardinalities and distributions that characterize the data have been discussed. Here, one will assess the task at hand in terms of the computational workload relative to the resources you have at one’s disposal. To estimate resource requirements, let’s start with some measurements. First one should consider the resources available. So far, we’ve been using a single m4.2xlarge Amazon EC2 instance. Let’s decode that quickly. EC2 is Amazon’s…

Read More

Size and shape of the data

We’ll start with a sample of 9 million observations, a small-enough sample to fit into memory in order to do some quick calculations of cardinality and distributions. Fortunately, most users never visit most of the domains, so the user/item matrix is sparsely populated, and there are tools at one’s disposal for dealing with large, sparse matrices. And nobody said that users and domains must be the rows and columns of…

Read More