Machine Learning - Fall 2016 - Class 21
admin
- assignment 9
- Office hours next week
- No office hours Tuesday and possibly Wednesday
- Tuesday (11/15)
- Assignment 9 workday
Mapreduce framework
-
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
- (this link also has some more nice MapReduce examples)
Within the code you write, this is how the program flows:
- start in the main method (like all java programs)
- this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
- this should be your "driver"
- junJob = start running on the hadoop cluster
- run the map phase
- the mapper is instantiated using the constructor
- needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
- what if your mapper needs a parameter?
- the configure method
- called before any of the calls to the map function
- it is passed the JobConf configuration object you constructed in your driver
- the JobConf object allows you to set arbitrary attributes
- the general "set" method, sets a string
- other "set" method exist, though for setting other types:
- setBoolean
- setInt
- setFloat
- setLong
- you can then grab these "set" attributes in the configure method using get
- good practice to use a global variable for the name of this configuration parameter
- finally, for each item in the input, the map function is called
- the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method
Combiners
- Combiners are sometimes called mini-reducers
- They run on the same machine as the map call and only reduce locally
- They are often used to speed up processing by doing an initial reduction before having to sort and redistribute the data
- In many cases, can use the reducer as the combiner (though not always)
- Look at using a combiner for
WordCount code
- In this case, we can just use the reducer as the combiner
Write grep (search of occurrences of text in a file)
- Key highlight for this example: how we can pass some data that is shared across all map/reduce calls
- Map:
- Input
- key: LongWritable
- value: Text
- Output
- key: LongWritable (byte offset)
- value: Text (line with an occurrence of the word)
- Check if the word being searched for is in the input text. If so, output key/value pair.
- Reduce: NoOpReducer... we're already done!
- How do we get the word to each of the map method calls?
- Use the configure method and an attribute we set in the JobConf
Look at
Grep code
Look at
SimpleWordFreqHistogram code
- Take the output from WordCount and counts those frequencies
- Note the use of separate files/classes for the mapper, reducer and driver
- Note also the use of the generic "SumReducer" reducer
Look at
WordFreqHistogram code
for full pipeline
- We can run multiple map reduce jobs by calling their driver methods in series, in this case WordCount and then SimpleWordFreqHistogram
- We use another input directory (working) to connect the first job to the second.
Look at
LineIndexer code
- Key: use the reporter to get information about the file being processed
- Sometimes we can't avoid using a data structure (or we could do it as another mapreduce dedup phase)
- This builds an inverted index, which is a key structure for how search engines work