This white paper by Dave Jaffe demonstrates analysis of large datasets using three different tools that are part of the Hadoop ecosystem – MapReduce, Hive and Pig. The application used is a geographic and temporal analysis of Apache web logs. The problem is explained in depth and then solutions are shown for the three tools. Complete code is included in the Appendices, along with a description of the GeoWeb Apache Log Generator tool as well as the R methods used to analyze and plot the results. Results are shown for all three tools with a 1TB set of log files and a 10TB set of log files.
Three popular tools for analyzing data resident in the Hadoop Distributed File System (HDFS) include MapReduce, Hive and Pig. MapReduce requires a computer program (often written in Java) to read, analyze and output the data. Hive provides a SQL-like front end for those with a database background. Pig provides a high-level language for data processing that also enables the user to exploit the parallelism inherent in a Hadoop cluster. Hive and Pig generate MapReduce code to do the actual analysis.
In this paper the three approaches are contrasted using a popular use case for Hadoop: Apache web log analysis.
The code used in this white paper is now available on github.com/.../BigDataDemos