CS150 - Fall 2011 - Class 13

  • admin
       - midterm
          - out of 54
          - average 46
          - high of 52
       - office hours 2:30-3:30 tomorrow (instead of 3-4)

  • what is a data structure?
       - a way of storing and organizing data
       - no free lunch:
          - different data structures are optimized to make different operations better (i.e. faster, more memory efficient, etc.)
          - there is not single best data structure
          - depending on the application, you will have to decide how to store your data

  • sets
       - what is a set, i.e. a set of data points?
          - an unordered collection of data

          - how does this differ from a list?
             - a list has a sequential order to it
       - what operations/methods might we want from a set?
          - create new/construct
          - add things to the set
          - ask if something belongs in the set
          - intersect
          - union
          - remove things from the set

  • classes
       - a "class"is the blueprint describing what data and methods an object will have
       - an object is an instance of a class
          - for example, we could define a class people
             - people have attributes
             - people will have methods
             - when we define a particular person, it is an object that is an instance of the class or people
       - classes define types
          - in Python, since all things are objects, then they all represent instances of objects
          - though in other languages, you could have a type that is not defined by a class

       - since everything we've seen is an object, then all the objects/types we've seen are classes
          - for any class, we can type help(class_name) to get information about the class (methods, etc.)

          >>> help(int)
          >>> help(list)

       - by the end of this class, you're going to be able to understand almost all of the information that comes back from calling help

  • set class
       >>> help(set)
          
       - the first thing we see is how to create new sets
       - these are called the constructors for a class
          - they define how we "construct" (or create) new objects (instances of that class)
          - we can construct a new set using a constructor

             >>> s = set()
             >>> s
             set([])
             >>> s = set([4, 3, 2, 1])
             >>> s
             set([1, 2, 3, 4])
             >>> s = set("abcd")
             >>> s
             set(['a', 'c', 'b', 'd'])

          - notice that there were two constructors
             - the empty constructor (set()), which created an empty set
             - and a constructor that took a single parameter
                - a list
                - a string
                - in general, any thing that we can iterate over in a for loop (we'll get back to this later)
          - when we print out the value of s it explicitly states that it is a set
             "set(1, 2, 3, 4)"
          - notice that there even though we may give it something where there is ordering, the ordering is not preserved
       - we've used constructors before
          >>> s = str(10)
          >>> x = int("1234")
       - some of the more common classes like int, float, list, etc. have special syntax (sometimes called "syntactic sugar") for creating the objects in a special way
             >>> 10
             10
             >>> [1, 2, 3, 4]
             [1, 2, 3, 4]
             >>> "abcd"
             'abcd'

          - but these are still just constructor calls
             >>> int(10)
             10
             >>> list([1, 2, 3])
             [1, 2, 3, 4]
             >>> str("abcd")
             'abcd'

       - set methods
          - class methods can be broken down into two types of methods
             - mutator methods that change the underlying object
             - accessor methods that do NOT change the underlying object, but ask some question about the data and give us some information back
          - from the help output, which of the following are mutator vs. accessor?
             - add
             - clear
             - difference
             - difference_update
             - intersection
             - intersection_update
             - ...
          - mutators: add, clear, different_update, intersection_update
             - all of these will change the object
          - accessor: difference, intersection
             - these will NOT change the object
          - other interesting methods
             - pop
             - remove
             - isdisjoint
             - issubset
             - issuperset
             - union
             - update
          - supports most of the methods you'd want for a set
             >>> s = set([1,2,3,4])
             >>> s.add(5)
             >>> s
             set([1, 2, 3, 4, 5])
             >>> s2 = set([4, 5, 6, 7])
             >>> s2
             set([4, 5, 6, 7])
             >>> s.difference(s2)
             set([1, 2, 3])
             >>> s
             set([1, 2, 3, 4, 5])
             >>> s2
             set([4, 5, 6, 7])
             >>> s.union(s2)
             set([1, 2, 3, 4, 5, 6, 7])
             >>> s.intersection(s2)
             set([4, 5])
             >>> s
             set([1, 2, 3, 4, 5])
             >>> s2
             set([4, 5, 6, 7])
             >>> s.intersection_update(s2)
             >>> s
             set([4, 5])
             >>> s2
             set([4, 5, 6, 7])

          - we can also ask if an item is in a set
             >>> 1 in s2
             False
             >>> 5 in s2
             True
             >>> "abc" in s2
             False
             >>> s2 in s2
             False

          - notice that you CANNOT index a set (there is no order)
             >>> s[0]
             Traceback (most recent call last):
              File "<string>", line 1, in <fragment>
             TypeError: 'set' object does not support indexing   

  • why sets?
       - seems like we could do all of these things and more with lists?
          - list has all of the operations like add, pop, find that
          - some nice operations like union and intersection, but we could put these in the list class
          - in fact, lists also support the "in" notation
             >>> some_list = [1, 2, 3, 4]
             >>> 4 in some_list
             True
             >>> "abc" in some_list
             False

       - why have the separate class for list?
          - performance!

       - write the following function:
          - contains(list, item)
             - returns True if the item is in the list
             - false otherwise
             - don't use "in" or "find"

          def contains(list, item):
             for thing in list:
                if thing == item:
                   return True
             
             return False

          - If we're searching for an item and we double the size of the list, how much longer (on average) do you think it would take to run this function?
             - twice as long
             - we're looping through each item in the list
             - computers are fast, but there still is a cost to each operation
          - what if we quadrupled the size of the list?
             - four times as long
          - the contains function above is called a "linear" runtime function
             - its runtime varies linearly with respect to the input
          - can we do better than linear for finding an item?

  • look at lists_vs_sets.py code
       - two functions for generating data
          - generate_set: generates random points and puts them into a set
          - generate_list: generates random points and puts them into a list
       - query_data
          - generates num_queries random numbers
          - uses "in" to see if they are in the data set
          - times how long it takes to do num_queries
       - speed_test
          - generates equal sized data sets in both list and set form
          - then calls query_data to see how long it takes to query each one
          
             >>> speed_test(1000, 100)
             List creation took 0.003422 seconds
             Set creation took 0.003589 seconds
             --
             List querying took 0.002917 seconds
             Set querying took 0.000194 seconds

          - for small sizes, they behave fairly similarly
          - as we increase the size of the set and the number of queries, however, we start to see some differences
       
             >>> speed_test(10000, 100)
             List creation took 0.023313 seconds
             Set creation took 0.021885 seconds
             --
             List querying took 0.021288 seconds
             Set querying took 0.000179 seconds

             >>> speed_test(10000, 1000)
             List creation took 0.020332 seconds
             Set creation took 0.021198 seconds
             --
             List querying took 0.213577 seconds
             Set querying took 0.001833 seconds

             >>> speed_test(100000, 1000)
             List creation took 0.186876 seconds
             Set creation took 0.220910 seconds
             --
             List querying took 2.148366 seconds
             Set querying took 0.001881 seconds

          - we can better understand these by generating these points and then plotting them
             >>> speed_data(10000, 1000, 10000, 500)
             1000   0.267589   0.001807
             1500   0.400795   0.002836
             2000   0.529208   0.003752
             2500   0.669099   0.004868
             3000   0.810630   0.005472
             3500   0.948448   0.006437
             4000   1.072060   0.007765
             4500   1.189531   0.008158
             5000   1.329393   0.009244
             5500   1.486140   0.010427
             6000   1.632416   0.011127
             6500   1.756407   0.012571
             7000   1.892501   0.013503
             7500   2.043325   0.014383
             8000   2.127294   0.014682
             8500   2.287051   0.015652
             9000   2.407338   0.016526
             9500   2.530725   0.017603

          - we can copy and paste this in Excel and plot it
             - we'll look later at how to plot within Pyton

  • when to use a set vs. a list?
       - lists have an ordering
          - if you need indexing, use a list
       - sets are faster for asking membership
          - if you don't care about the order, use a set!