Machine Learning - Spring 2022 - Class 22

  • All hadoop demos can be found in the hadoop examples directory
  • admin
       - grading update
       
       - Tuesday's class
          - Get together in project groups and discuss project ideas for 20 minutes
          - Each group will take 1-2 minutes to present and 1-2 minutes for discussion

       - Midterm 2 will be posted on Monday
          - Cover everything from large margin classifiers up through mapreduce
             - won't ask you to write mapreduce code
          - open book, open notes
       
       - Final project

  • Mapreduce framework
       - http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
          - (this link also has some more nice MapReduce examples)

  • Look at the NoOpReducer

  • Within the code you write, this is how the program flows:
       - start in the main method (like all java programs)
          - this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
          - this should be your "driver"
       - junJob = start running on the hadoop cluster
       - run the map phase
          - the mapper is instantiated using the constructor
             - needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
             - what if your mapper needs a parameter?
          - the configure method
             - called before any of the calls to the map function
             - it is passed to the JobConf configuration object you constructed in your driver
             - the JobConf object allows you to set arbitrary attributes
                - the general "set" method, sets a string
                - other "set" method exist, though for setting other types:
                   - setBoolean
                   - setInt
                   - setFloat
                   - setLong
             - you can then grab these "set" attributes in the configure method using get
                - good practice to use a global variable for the name of this configuration parameter
          - finally, for each item in the input, the map function is called
       - the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method

  • Combiners
       - Combiners are sometimes called mini-reducers
       - They run on the same machine as the map call and only reduce locally
       - They are often used to speed up processing by doing an initial reduction before having to sort and redistribute the data
       - In many cases, can use the reducer as the combiner (though not always)
       - Look at using a combiner for WordCount code
          - In this case, we can just use the reducer as the combiner

  • Write grep (search of occurrences of text in a file)
       - Key highlight for this example: how we can pass some data that is shared across all map/reduce calls
       - Map:
          - Input
             - key: LongWritable
             - value: Text
          - Output
             - key: LongWritable (byte offset)
             - value: Text (line with an occurrence of the word)
          - Check if the word being searched for is in the input text. If so, output key/value pair.
       - Reduce: NoOpReducer... we're already done!

       - How do we get the word to each of the map method calls?
          - Use the configure method and an attribute we set in the JobConf

  • Look at Grep code

  • Look at SimpleWordFreqHistogram code
       - Take the output from WordCount and counts those frequencies
       - Note the use of separate files/classes for the mapper, reducer and driver
       - Note also the use of the generic "SumReducer" reducer

  • Look at WordFreqHistogram code for full pipeline
       - We can run multiple map reduce jobs by calling their driver methods in series, in this case WordCount and then SimpleWordFreqHistogram
       - We use another input directory (working) to connect the first job to the second.

  • Look at LineIndexer code
       - Key: use the reporter to get information about the file being processed
       - Sometimes we can't avoid using a data structure (or we could do it as another mapreduce dedup phase)
       - This builds an inverted index, which is a key structure for how search engines work