CS51A - Spring 2019

CS51A - Spring 2019 - Class 34

Example code in this lecture

   url_extractor.py
   url_extractor_improved.py
   url_basics_ssl.py
   exceptions.py

Lecture notes

Revisiting url_extractor.py code
   - look at the webpage
   - look at the output: do we get *all* of the lecture note links?
      - No! We miss those with the .pptx links. Why?
      - the code assumes one lecture per line, but that's not true

   - how do we fix this?
      - rather than searching per line, treat the entire webpage as a long string
      - search for the first occurrence of lecture
      - extract it
      - then search again starter at the end of that occurrence

look get_note_urls_improved function in url_extractor_improved.py code
   - read
      - rather than reading a line at a time, we can read the entire contents all at once
      - this also works on files
   - we then decode this so that page_text has all of the webpage text as a string

   - what does "begin_index = page_text.find(search_line)" do?
      - searches for the index of the first occurrence of "lectures/"
      - will the code enter the while loop?
         - if it finds an occurrence

   - what does "end_index = page_text.find('"', begin_index)" do?
      - searches for the end of the link
   - we can then extract the url
   - what does "begin_index = page_text.find(search_line, end_index)" do?
      - searches *again*, but not starting at end_index, the end of the last link found

   - if we run the improved version, we now get the pptx links too

how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
   - look at get_note_files_only function in url_extractor_improved.py code

   - key change: we want to skip the "lectures/" part when extracting the page
      - rather than using begin_index, we want to skip the length of "lectures/" forward when extracting

what's the difference between http and https
   - the 's' stands for secure
   - when you communicate with an https website:
      - you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
      - your communications are encrypted so it's difficult to see what information you're sending back and forth
   - there is a bit of overhead in setting up this communication properly
      - the right way is to install SSL certificates for python
   - for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
   - look at url_basics_ssl.py code
      - urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificats

web crawler
- how does google know what's on the web?

reading web pages: ethics
   - you are reading a file on a remote server
      - you shouldn't be doing this repeatedly
      - if you're trying to debug some code, copy the source into a file and debug that way before running live
   - there are some restrictions about what content a web site owner may want you looking at
      - see http://www.robotstxt.org/

exceptions
   - look at the list_max function in exceptions.py code
      - are there any inputs that would give an error?
         - non-numerical
         - empty lists
      - how could we fix this?
         - check if its equal to the empty list
            - print an error message
            - return ???
   - a better way to fix this is to raise an exception (like you've probably seen for other problems)

   - Exceptions are another way of communicating information from a function/expression:
      >>> 1/0
      Traceback (most recent call last):
       Python Shell, prompt 3, line 1
      builtins.ZeroDivisionError: division by zero

   - they allow us to give information back from a function besides return

   - if we don't do anything about them, exceptions will cause the program to terminate

raising exceptions
   - look at the list_max_better function in exceptions.py code
   - to raise an exception, you use the keyword "raise" and then create a new Exception object

      >>> list_max_better([1, 2, 3])
      3
      >>> list_max_better([])
      Traceback (most recent call last):
       Python Shell, prompt 3, line 1
       # Used internally for debug sandbox under external interpreter
       File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 12, in <module>
       raise Exception("list must be non-empty")
      builtins.Exception: list must be non-empty

look at the get_scores function in exceptions.py code
   - are there any inputs that the user could enter that would cause a problem? Specifically, cause the function to exit early?
      >>> get_scores()
      Enter the scores one at a time. Blank score finishes.
      Enter score: 1
      Enter score: banana
      Traceback (most recent call last):
       Python Shell, prompt 2, line 1
       # Used internally for debug sandbox under external interpreter
       File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 29, in <module>
       scores.append(float(line))
      builtins.ValueError: could not convert string to float: 'banana'

      - if we enter a non-numerical value, we get a "ValueError"
   - what would you like to do?
      - better to prompt the user to enter a number and try again