Apache Pig makes running operations against data in Hadoop far easier than coding that in Java, which is the most common way to work with Hadoop data without Pig. Hadoop, itself, is written in Java. What Pig does is make really simple MapReduce operations by providing a simple syntax for that, similar to SQL. MapReduce are operations that parse data sets, filter them, change their format, joins set of data, etc. It is also called ETL (extract, transform, load.)Here we show some examples of how to use Pig to do MapReduce.Run ETL on Weather DataSuppose we have this weather data. These are temperature readings for early 2017 from weather stations around San Francisco. The comma-delimited data looks like this:STATION,STATION_NAME,DATE,TAVG,TMAX,TMIN,TOBSGHCND:USC00041967,CONCORD WASTEWATER PLANT CA US,20170101,-9999,55,42,-9999We have to download this to the file /root/Downloads/888069.csvNow, we start Pig. You need to have installed Hadoop first in order to use Pig. We run it with the “local” option in this example to run it is local mode versus cluster mode. That way we can read local files without having to first have copied them to the HDFS (Hadoop Distributed File System.)pig -x localThen we load the weather data into a weather object using this syntax. This separates the comma-delimited fields by the comma and then assigns the field type to each field. Notice that there is no native date type in Pig, so we use chararray.weather = LOAD '/root/Downloads/888069.csv' USING PigStorage(',') as (STATION:chararray,STATION_NAME:chararray,DATE:chararray,TAVG:int,TMAX:int,TTMIN:int,TOBS:int);Then to look at the data we do the following:dump weather(GHCND:USC00047414,RICHMOND CA US,20170201,-9999,73,43,60)
(GHCND:USC00047414,RICHMOND CA US,20170202,-9999,59,47,57)It has this structure, which we obtain using the word describe:describe weatherweather: {STATION: chararray,STATION_NAME: chararray,DATE: chararray,TAVG: int,TMAX: int,TMIN: int,TOBS: int}Each record is a tuple, meaning a set of fields separated by commas. Note that Pig is said to be “lazy.” That means it does not retrieve the data in the LOAD step. It only does that when you dump it or otherwise use it. So this is the point at which you would see any errors in the previous step, like the input filename is spelled wrong.Now we can filter on this data, pulling out January and February weather using a regular expression.January = FILTER weather BY(DATE matches '201701.*');
February = FILTER weather BY(DATE matches '201702.*');Which then looks like this.(GHCND:USC00047414,RICHMOND CA US,20170123,-9999,-9999,42,48)
(GHCND:USC00047414,RICHMOND CA US,20170124,-9999,-9999,43,53)Now we pull out just the 4 fields we are interested in.janTemp = FOREACH January GENERATE (STATION_NAME,DATE,TMAX,TMIN);
febTemp = FOREACH February GENERATE (STATION_NAME,DATE,TMAX,TMIN);Notice that now we no longer have a tuple. We have a tuple within a tuple, which is called a bag, which you can see by the two parentheses on either end.((LAS TRAMPAS CALIFORNIA CA US,20170122,50,38))
((LAS TRAMPAS CALIFORNIA CA US,20170123,49,36))You can also confirm this by looking at the structure where it says tuple.describejanTempjanTemp: {org.apache.pig.builtin.totuple_DATE_669: (STATION_NAME: chararray,DATE: chararray,TMAX: int,TMIN: int)We cannot runthe filter operation on that. We could if it was a map structure, which has keys and values. So we flatten it back out, which works just like the flat command in other languages. It takes the object down one level in nesting. So instead of a tuple with a tuple inside it, we just have a tuple of elements. We use $0 to refer to the first element. Each element is, of course, a tuple.flatJan = FOREACH janTemp GENERATE flatten($0);
flatFeb = FOREACH febTemp GENERATE flatten($0);Now it looks like a regular Tuple.(RICHMOND CA US,20170201,73,43)
(RICHMOND CA US,20170202,59,47)