CS51A - Spring 2019 - Class 34

Example code in this lecture

   url_extractor.py
   url_extractor_improved.py
   url_basics_ssl.py
   exceptions.py

Lecture notes

  • Revisiting url_extractor.py code
       - look at the webpage
       - look at the output: do we get *all* of the lecture note links?
          - No! We miss those with the .pptx links. Why?
          - the code assumes one lecture per line, but that's not true

       - how do we fix this?
          - rather than searching per line, treat the entire webpage as a long string
          - search for the first occurrence of lecture
          - extract it
          - then search again starter at the end of that occurrence

  • look get_note_urls_improved function in url_extractor_improved.py code
       - read
          - rather than reading a line at a time, we can read the entire contents all at once
          - this also works on files
       - we then decode this so that page_text has all of the webpage text as a string

       - what does "begin_index = page_text.find(search_line)" do?
          - searches for the index of the first occurrence of "lectures/"
          - will the code enter the while loop?
             - if it finds an occurrence

       - what does "end_index = page_text.find('"', begin_index)" do?
          - searches for the end of the link
       - we can then extract the url
       - what does "begin_index = page_text.find(search_line, end_index)" do?
          - searches *again*, but not starting at end_index, the end of the last link found

       - if we run the improved version, we now get the pptx links too

  • how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
       - look at get_note_files_only function in url_extractor_improved.py code
       
       - key change: we want to skip the "lectures/" part when extracting the page
          - rather than using begin_index, we want to skip the length of "lectures/" forward when extracting

  • what's the difference between http and https
       - the 's' stands for secure
       - when you communicate with an https website:
          - you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
          - your communications are encrypted so it's difficult to see what information you're sending back and forth
       - there is a bit of overhead in setting up this communication properly
          - the right way is to install SSL certificates for python
       - for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
       - look at url_basics_ssl.py code
          - urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificats

  • web crawler
       - how does google know what's on the web?

  • reading web pages: ethics
       - you are reading a file on a remote server
          - you shouldn't be doing this repeatedly
          - if you're trying to debug some code, copy the source into a file and debug that way before running live
       - there are some restrictions about what content a web site owner may want you looking at
          - see http://www.robotstxt.org/


  • exceptions
       - look at the list_max function in exceptions.py code
          - are there any inputs that would give an error?
             - non-numerical
             - empty lists
          - how could we fix this?
             - check if its equal to the empty list
                - print an error message
                - return ???
       - a better way to fix this is to raise an exception (like you've probably seen for other problems)

       - Exceptions are another way of communicating information from a function/expression:
          >>> 1/0
          Traceback (most recent call last):
           Python Shell, prompt 3, line 1
          builtins.ZeroDivisionError: division by zero


       - they allow us to give information back from a function besides return
       
       - if we don't do anything about them, exceptions will cause the program to terminate

  • raising exceptions
       - look at the list_max_better function in exceptions.py code
       - to raise an exception, you use the keyword "raise" and then create a new Exception object
       
          >>> list_max_better([1, 2, 3])
          3
          >>> list_max_better([])
          Traceback (most recent call last):
           Python Shell, prompt 3, line 1
           # Used internally for debug sandbox under external interpreter
           File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 12, in <module>
           raise Exception("list must be non-empty")
          builtins.Exception: list must be non-empty

  • look at the get_scores function in exceptions.py code
       - are there any inputs that the user could enter that would cause a problem? Specifically, cause the function to exit early?
          >>> get_scores()
          Enter the scores one at a time. Blank score finishes.
          Enter score: 1
          Enter score: banana
          Traceback (most recent call last):
           Python Shell, prompt 2, line 1
           # Used internally for debug sandbox under external interpreter
           File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 29, in <module>
           scores.append(float(line))
          builtins.ValueError: could not convert string to float: 'banana'

          - if we enter a non-numerical value, we get a "ValueError"
       - what would you like to do?
          - better to prompt the user to enter a number and try again