Machine Learning - Spring 2022 - Class 22
All hadoop demos can be found in the
hadoop examples
directory
admin
- grading update
- Tuesday's class
- Get together in project groups and discuss project ideas for 20 minutes
- Each group will take 1-2 minutes to present and 1-2 minutes for discussion
- Midterm 2 will be posted on Monday
- Cover everything from large margin classifiers up through mapreduce
- won't ask you to write mapreduce code
- open book, open notes
- Final project
Mapreduce framework
-
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
- (this link also has some more nice MapReduce examples)
Look at the NoOpReducer
Within the code you write, this is how the program flows:
- start in the main method (like all java programs)
- this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
- this should be your "driver"
- junJob = start running on the hadoop cluster
- run the map phase
- the mapper is instantiated using the constructor
- needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
- what if your mapper needs a parameter?
- the configure method
- called before any of the calls to the map function
- it is passed to the JobConf configuration object you constructed in your driver
- the JobConf object allows you to set arbitrary attributes
- the general "set" method, sets a string
- other "set" method exist, though for setting other types:
- setBoolean
- setInt
- setFloat
- setLong
- you can then grab these "set" attributes in the configure method using get
- good practice to use a global variable for the name of this configuration parameter
- finally, for each item in the input, the map function is called
- the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method
Combiners
- Combiners are sometimes called mini-reducers
- They run on the same machine as the map call and only reduce locally
- They are often used to speed up processing by doing an initial reduction before having to sort and redistribute the data
- In many cases, can use the reducer as the combiner (though not always)
- Look at using a combiner for
WordCount code
- In this case, we can just use the reducer as the combiner
Write grep (search of occurrences of text in a file)
- Key highlight for this example: how we can pass some data that is shared across all map/reduce calls
- Map:
- Input
- key: LongWritable
- value: Text
- Output
- key: LongWritable (byte offset)
- value: Text (line with an occurrence of the word)
- Check if the word being searched for is in the input text. If so, output key/value pair.
- Reduce: NoOpReducer... we're already done!
- How do we get the word to each of the map method calls?
- Use the configure method and an attribute we set in the JobConf
Look at
Grep code
Look at
SimpleWordFreqHistogram code
- Take the output from WordCount and counts those frequencies
- Note the use of separate files/classes for the mapper, reducer and driver
- Note also the use of the generic "SumReducer" reducer
Look at
WordFreqHistogram code
for full pipeline
- We can run multiple map reduce jobs by calling their driver methods in series, in this case WordCount and then SimpleWordFreqHistogram
- We use another input directory (working) to connect the first job to the second.
Look at
LineIndexer code
- Key: use the reporter to get information about the file being processed
- Sometimes we can't avoid using a data structure (or we could do it as another mapreduce dedup phase)
- This builds an inverted index, which is a key structure for how search engines work