CS51A - Spring 2019 - Class 34
Example code in this lecture
url_extractor.py
url_extractor_improved.py
url_basics_ssl.py
exceptions.py
Lecture notes
Revisiting
url_extractor.py code
- look at the webpage
- look at the output: do we get *all* of the lecture note links?
- No! We miss those with the .pptx links. Why?
- the code assumes one lecture per line, but that's not true
- how do we fix this?
- rather than searching per line, treat the entire webpage as a long string
- search for the first occurrence of lecture
- extract it
- then search again starter at the end of that occurrence
look get_note_urls_improved function in
url_extractor_improved.py code
- read
- rather than reading a line at a time, we can read the entire contents all at once
- this also works on files
- we then decode this so that page_text has all of the webpage text as a string
- what does "begin_index = page_text.find(search_line)" do?
- searches for the index of the first occurrence of "lectures/"
- will the code enter the while loop?
- if it finds an occurrence
- what does "end_index = page_text.find('"', begin_index)" do?
- searches for the end of the link
- we can then extract the url
- what does "begin_index = page_text.find(search_line, end_index)" do?
- searches *again*, but not starting at end_index, the end of the last link found
- if we run the improved version, we now get the pptx links too
how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
- look at get_note_files_only function in
url_extractor_improved.py code
- key change: we want to skip the "lectures/" part when extracting the page
- rather than using begin_index, we want to skip the length of "lectures/" forward when extracting
what's the difference between http and https
- the 's' stands for secure
- when you communicate with an https website:
- you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
- your communications are encrypted so it's difficult to see what information you're sending back and forth
- there is a bit of overhead in setting up this communication properly
- the right way is to install SSL certificates for python
- for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
- look at
url_basics_ssl.py code
- urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificats
web crawler
- how does google know what's on the web?
reading web pages: ethics
- you are reading a file on a remote server
- you shouldn't be doing this repeatedly
- if you're trying to debug some code, copy the source into a file and debug that way before running live
- there are some restrictions about what content a web site owner may want you looking at
- see
http://www.robotstxt.org/
exceptions
- look at the list_max function in
exceptions.py code
- are there any inputs that would give an error?
- non-numerical
- empty lists
- how could we fix this?
- check if its equal to the empty list
- print an error message
- return ???
- a better way to fix this is to raise an exception (like you've probably seen for other problems)
- Exceptions are another way of communicating information from a function/expression:
>>> 1/0
Traceback (most recent call last):
Python Shell, prompt 3, line 1
builtins.ZeroDivisionError: division by zero
- they allow us to give information back from a function besides return
- if we don't do anything about them, exceptions will cause the program to terminate
raising exceptions
- look at the list_max_better function in
exceptions.py code
- to raise an exception, you use the keyword "raise" and then create a new Exception object
>>> list_max_better([1, 2, 3])
3
>>> list_max_better([])
Traceback (most recent call last):
Python Shell, prompt 3, line 1
# Used internally for debug sandbox under external interpreter
File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 12, in <module>
raise Exception("list must be non-empty")
builtins.Exception: list must be non-empty
look at the get_scores function in
exceptions.py code
- are there any inputs that the user could enter that would cause a problem? Specifically, cause the function to exit early?
>>> get_scores()
Enter the scores one at a time. Blank score finishes.
Enter score: 1
Enter score: banana
Traceback (most recent call last):
Python Shell, prompt 2, line 1
# Used internally for debug sandbox under external interpreter
File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 29, in <module>
scores.append(float(line))
builtins.ValueError: could not convert string to float: 'banana'
- if we enter a non-numerical value, we get a "ValueError"
- what would you like to do?
- better to prompt the user to enter a number and try again