Machine Learning - Fall 2016

Machine Learning - Fall 2016 - Class 21

admin
   - assignment 9

   - Office hours next week
      - No office hours Tuesday and possibly Wednesday

   - Tuesday (11/15)
      - Assignment 9 workday

Mapreduce framework
- http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
- (this link also has some more nice MapReduce examples)

Within the code you write, this is how the program flows:
   - start in the main method (like all java programs)
      - this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
      - this should be your "driver"
   - junJob = start running on the hadoop cluster
   - run the map phase
      - the mapper is instantiated using the constructor
         - needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
         - what if your mapper needs a parameter?
      - the configure method
         - called before any of the calls to the map function
         - it is passed the JobConf configuration object you constructed in your driver
         - the JobConf object allows you to set arbitrary attributes
            - the general "set" method, sets a string
            - other "set" method exist, though for setting other types:
               - setBoolean
               - setInt
               - setFloat
               - setLong
         - you can then grab these "set" attributes in the configure method using get
            - good practice to use a global variable for the name of this configuration parameter
      - finally, for each item in the input, the map function is called
   - the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method

Combiners
   - Combiners are sometimes called mini-reducers
   - They run on the same machine as the map call and only reduce locally
   - They are often used to speed up processing by doing an initial reduction before having to sort and redistribute the data
   - In many cases, can use the reducer as the combiner (though not always)
   - Look at using a combiner for WordCount code
      - In this case, we can just use the reducer as the combiner

Write grep (search of occurrences of text in a file)
   - Key highlight for this example: how we can pass some data that is shared across all map/reduce calls
   - Map:
      - Input
         - key: LongWritable
         - value: Text
      - Output
         - key: LongWritable (byte offset)
         - value: Text (line with an occurrence of the word)
      - Check if the word being searched for is in the input text. If so, output key/value pair.
   - Reduce: NoOpReducer... we're already done!

   - How do we get the word to each of the map method calls?
      - Use the configure method and an attribute we set in the JobConf

Look at Grep code

Look at SimpleWordFreqHistogram code
   - Take the output from WordCount and counts those frequencies
   - Note the use of separate files/classes for the mapper, reducer and driver
   - Note also the use of the generic "SumReducer" reducer

Look at WordFreqHistogram code for full pipeline
- We can run multiple map reduce jobs by calling their driver methods in series, in this case WordCount and then SimpleWordFreqHistogram
- We use another input directory (working) to connect the first job to the second.

Look at LineIndexer code
   - Key: use the reporter to get information about the file being processed
   - Sometimes we can't avoid using a data structure (or we could do it as another mapreduce dedup phase)
   - This builds an inverted index, which is a key structure for how search engines work