CS51A - Fall 2019 - Class 24

Example code in this lecture

   url_extractor_improved.py
   url_basics_ssl.py
   exceptions.py
   lists_vs_sets.py

Lecture notes

look get_note_urls_improved function in url_extractor_improved.py code
   - read
      - rather than reading a line at a time, we can read the entire contents all at once
      - this also works on files
   - we then decode this so that page_text has all of the webpage text as a string
   - what does "begin_index = page_text.find(search_line)" do?
      - searches for the index of the first occurrence of "lectures/"
      - will the code enter the while loop?
         - if it finds an occurrence
   - what does "end_index = page_text.find('"', begin_index)" do?
      - searches for the end of the link
   - we can then extract the url
   - what does "begin_index = page_text.find(search_line, end_index)" do?
      - searches *again*, but not starting at end_index, the end of the last link found
   - if we run the improved version, we now get the pptx links too

how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
   - look at get_note_files_only function in url_extractor_improved.py code
   - key change: we want to skip the "lectures/" part when extracting the page
      - rather than using begin_index, we want to skip the length of "lectures/" forward when extracting

what's the difference between http and https
   - the 's' stands for secure
   - when you communicate with an https website:
      - you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
      - your communications are encrypted so it's difficult to see what information you're sending back and forth
   - there is a bit of overhead in setting up this communication properly
      - the right way is to install SSL certificates for python
   - for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
   - look at url_basics_ssl.py code
      - urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificates

web crawler
   - how does google know what's on the web?

reading web pages: ethics
   - you are reading a file on a remote server
      - you shouldn't be doing this repeatedly
      - if you're trying to debug some code, copy the source into a file and debug that way before running live
   - there are some restrictions about what content a web site owner may want you looking at
      - see http://www.robotstxt.org/

exceptions
   - look at the list_max function in exceptions.py code
      - are there any inputs that would give an error?
         - non-numerical
         - empty lists
      - how could we fix this?
         - check if its equal to the empty list
            - print an error message
            - return ???
   - a better way to fix this is to raise an exception (like you've probably seen for other problems)
   - Exceptions indicate unrecoverable errors, often programmer errors
      - Within the function where the error occurs, we don't have enough information to fix it
   - Exceptions are another way of communicating information from a function/expression:
      >>> 1/0
      Traceback (most recent call last):
       Python Shell, prompt 3, line 1
      builtins.ZeroDivisionError: division by zero
   - they allow us to give information back from a function besides return
   - if we don't do anything about them, exceptions will cause the program to terminate

raising exceptions
   - look at the list_max_better function in exceptions.py code
   - to raise an exception, you use the keyword "raise" and then create a new Exception object
      >>> list_max_better([1, 2, 3])
      3
      >>> list_max_better([])
      Traceback (most recent call last):
       Python Shell, prompt 3, line 1
       # Used internally for debug sandbox under external interpreter
       File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 12, in <module>
       raise Exception("list must be non-empty")
      builtins.Exception: list must be non-empty

look at the get_scores function in exceptions.py code
   - are there any inputs that the user could enter that would cause a problem? Specifically, cause the function to exit early?
      >>> get_scores()
      Enter the scores one at a time. Blank score finishes.
      Enter score: 1
      Enter score: banana
      Traceback (most recent call last):
       Python Shell, prompt 2, line 1
       # Used internally for debug sandbox under external interpreter
       File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 29, in <module>
       scores.append(float(line))
      builtins.ValueError: could not convert string to float: 'banana'

      - if we enter a non-numerical value, we get a "ValueError"
   - what would you like to do?
      - better to prompt the user to enter a number and try again
    - how can we do this?
      - one way would be to check that the string is a valid number
         - kind of a pain
            - decimal numbers
            - positive/negative
            - even scientific notation is fair game, e.g. 1.3e10
      - better way: handle the exception and deal with it
try/except
   - we can catch an exception and deal with it using a try/except block (an "exception handler"):
   - syntax:
      try:
         some code that could raise an exception

      except ExceptionName:
         what to do if exception occurs

      - the code in the block is executed
      - if no exception is raised
         - the code finishes
         - the code in the "except" block is skipped and the code keeps running
      - if an exception occurs
         - the code in the try block is immediately excited
         - if it's of the type in the except block
            - the code in the except block executes
            - then the code keeps running after that
         - if it's another exception, it exits

   - how does this help us for the get_scores function?

   - look at the get_scores_better function in exceptions.py code
      - we can handle the ValueError exception and print out an error message, but keep going
      >>> get_scores_better()
      Enter the scores one at a time. Blank score finishes.
      Enter score: 1
      Enter score: banana
      Enter numbers only!
      Enter score: 2
      Enter score:
      [1.0, 2.0]

look at print_file_stats in exceptions.py code
   - where could we get exceptions from this code?
      - file doesn't exist!
      - if the file is empty, then we could also get a divide by zero error

look at the print_file_stats_better function in exceptions.py code
   - if we have multiple exceptions, we can have multiple except blocks
      - each block will only be executed if an exception of that type is raised
   - in the case of the divide by zero error, we'll already have printed out some information (number of words, longest word, shortest word). All we want to do is not have an error raised.

pass
   - certain control statements expect code to be there (e.g., if/then, try/except
   - pass can be used as a non-operation: it is code, but it doesn't do anything

sets
   - what is a set, e.g. a set of data points?
      - an unordered collection of data
      - how does this differ from a list?
         - a list has a sequential order to it
         - lists might have duplicates
   - what operations/methods might we want from a set?
      - create new/construct
      - add things to the set
      - ask if something belongs in the set
      - intersect
      - union
      - remove things from the set

set class
   >>> help(set)

   - the first thing we see is how to create new sets
      - we can construct a new set using a constructor or using {} (kind of like dictionaries)

         >>> s = set()
         >>> s
         {}
         >>> s = set([4, 3, 2, 1])
         >>> s
         {1, 2, 3, 4}
         >>> s = {4, 3, 2 ,1}
         {1, 2, 3, 4}
         >>> s = set("abcd")
         >>> s
         {'a', 'c', 'b', 'd'}
         >>> s = {1, 1, 1, 1, 2, 2}
         {1, 2}

      - notice that there were two constructors
         - the empty constructor (set()), which created an empty set
         - and a constructor that took a single parameter
            - a list
            - a string
            - in general, any thing that we can iterate over in a for loop
      - when we print out the value of s it explicitly states that it is a set
         "set([1, 2, 3, 4])"
      - notice that even though we may give it something where there is ordering, the ordering is NOT preserved

   - set methods
      - class methods can be broken down into two types of methods
         - mutator methods that change the underlying object
         - accessor methods that do NOT change the underlying object, but ask some question about the data and give us some information back
      - from the help output, which of the following are mutator vs. accessor?
         - add
         - clear
         - difference
         - difference_update
         - intersection
         - intersection_update
         - ...
      - mutators: add, clear, different_update, intersection_update
         - all of these will change the object
      - accessor: difference, intersection
         - these will NOT change the object
      - other interesting methods
         - pop
         - remove
         - isdisjoint
         - issubset
         - issuperset
         - union
         - update
      - supports most of the methods you'd want for a set
         >>> s = {1,2,3,4}
         >>> s.add(5)
         >>> s
         {1, 2, 3, 4, 5}
         >>> s2 = set([4, 5, 6, 7])
         >>> s2
         {4, 5, 6, 7}
         >>> s.difference(s2)
         {1, 2, 3}
         >>> s
         {1, 2, 3, 4, 5}
         >>> s2
         {4, 5, 6, 7}
         >>> s.union(s2)
         {1, 2, 3, 4, 5, 6, 7}
         >>> s.intersection(s2)
         {4, 5}
         >>> s
         {1, 2, 3, 4, 5]}
         >>> s2
         {4, 5, 6, 7}

      - we can also ask if an item is in a set
         >>> 1 in s2
         False
         >>> 5 in s2
         True
         >>> "abc" in s2
         False
         >>> s2 in s2
         False

      - notice that you CANNOT index into a set (there is no order)
         >>> s[0]
         Traceback (most recent call last):
          File "<string>", line 1, in <fragment>
         TypeError: 'set' object does not support indexing

why sets?
   - seems like we could do all of these things and more with lists?
      - list has all of the operations like add, pop, find that sets have
      - some nice operations like union and intersection, but we could put these in the list class
      - in fact, lists also support the "in" notation
         >>> some_list = [1, 2, 3, 4]
         >>> 4 in some_list
         True
         >>> "abc" in some_list
         False

   - why have the separate class for set?
      - performance!

   - write the following function:
      - contains(list, item)
         - returns True if the item is in the list
         - false otherwise
         - don't use "in" or "find"

      def contains(list, item):
         for thing in list:
            if thing == item:
               return True

         return False

      - If we're searching for an item and we double the size of the list, how much longer (on average) do you think it would take to run this function?
         - twice as long
         - we're looping through each item in the list ( O(n)! )
         - computers are fast, but there still is a cost to each operation
      - what if we quadrupled the size of the list?
         - four times as long
      - the contains function above is called a "linear" runtime function
         - its runtime varies linearly with respect to the input
      - can we do better than linear for finding an item?

look at lists_vs_sets.py code
   - two functions for generating data
      - generate_set: generates random points and puts them into a set
      - generate_list: generates random points and puts them into a list
   - query_data
      - generates num_queries random numbers
      - uses "in" to see if they are in the data set
      - times how long it takes to do num_queries
   - speed_test
      - generates equal sized data sets in both list and set form
      - then calls query_data to see how long it takes to query each one

         >>> speed_test(1000, 100)
         List creation took 0.003422 seconds
         Set creation took 0.003589 seconds
         --
         List querying took 0.002917 seconds
         Set querying took 0.000194 seconds

      - for small sizes, they behave fairly similarly
      - as we increase the size of the set and the number of queries, however, we start to see some differences

         >>> speed_test(10000, 100)
         List creation took 0.023313 seconds
         Set creation took 0.021885 seconds
         --
         List querying took 0.021288 seconds
         Set querying took 0.000179 seconds

         >>> speed_test(10000, 1000)
         List creation took 0.020332 seconds
         Set creation took 0.021198 seconds
         --
         List querying took 0.213577 seconds
         Set querying took 0.001833 seconds

         >>> speed_test(100000, 1000)
         List creation took 0.186876 seconds
         Set creation took 0.220910 seconds
         --
         List querying took 2.148366 seconds
         Set querying took 0.001881 seconds

      - we can better understand these by generating points as we increase the size of the set/list and then plotting them
         >>> speed_data(5000, 10000, 100000, 5000)
         size   list   set
         10000   0.237790   0.001881
         15000   0.358325   0.001999
         20000   0.469743   0.001956
         25000   0.602107   0.001916
         30000   0.687776   0.001889
         35000   0.824027   0.001903
         40000   0.921235   0.001952
         45000   1.009843   0.001912
         50000   1.156059   0.001927
         55000   1.386080   0.001913
         60000   1.566058   0.001984
         65000   1.722870   0.001936
         70000   2.025138   0.001966
         75000   2.363384   0.001962
         80000   2.619580   0.002030
         85000   2.897005   0.002054
         90000   2.975576   0.001946
         95000   3.418256   0.002082

      - we can copy and paste this in Excel and plot it

when to use a set vs. a list?
   - lists have an ordering and allow duplicates
      - if you need indexing, use a list
   - sets are faster for asking membership
      - if you don't care about the order, use a set!