CS51A - Fall 2019 - Class 24
Example code in this lecture
url_extractor_improved.py
url_basics_ssl.py
exceptions.py
lists_vs_sets.py
Lecture notes
look get_note_urls_improved function in
url_extractor_improved.py code
- read
- rather than reading a line at a time, we can read the entire contents all at once
- this also works on files
- we then decode this so that page_text has all of the webpage text as a string
- what does "begin_index = page_text.find(search_line)" do?
- searches for the index of the first occurrence of "lectures/"
- will the code enter the while loop?
- if it finds an occurrence
- what does "end_index = page_text.find('"', begin_index)" do?
- searches for the end of the link
- we can then extract the url
- what does "begin_index = page_text.find(search_line, end_index)" do?
- searches *again*, but not starting at end_index, the end of the last link found
- if we run the improved version, we now get the pptx links too
how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
- look at get_note_files_only function in
url_extractor_improved.py code
- key change: we want to skip the "lectures/" part when extracting the page
- rather than using begin_index, we want to skip the length of "lectures/" forward when extracting
what's the difference between http and https
- the 's' stands for secure
- when you communicate with an https website:
- you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
- your communications are encrypted so it's difficult to see what information you're sending back and forth
- there is a bit of overhead in setting up this communication properly
- the right way is to install SSL certificates for python
- for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
- look at
url_basics_ssl.py code
- urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificates
web crawler
- how does google know what's on the web?
reading web pages: ethics
- you are reading a file on a remote server
- you shouldn't be doing this repeatedly
- if you're trying to debug some code, copy the source into a file and debug that way before running live
- there are some restrictions about what content a web site owner may want you looking at
- see
http://www.robotstxt.org/
exceptions
- look at the list_max function in
exceptions.py code
- are there any inputs that would give an error?
- non-numerical
- empty lists
- how could we fix this?
- check if its equal to the empty list
- print an error message
- return ???
- a better way to fix this is to raise an exception (like you've probably seen for other problems)
- Exceptions indicate unrecoverable errors, often programmer errors
- Within the function where the error occurs, we don't have enough information to fix it
- Exceptions are another way of communicating information from a function/expression:
>>> 1/0
Traceback (most recent call last):
Python Shell, prompt 3, line 1
builtins.ZeroDivisionError: division by zero
- they allow us to give information back from a function besides return
- if we don't do anything about them, exceptions will cause the program to terminate
raising exceptions
- look at the list_max_better function in
exceptions.py code
- to raise an exception, you use the keyword "raise" and then create a new Exception object
>>> list_max_better([1, 2, 3])
3
>>> list_max_better([])
Traceback (most recent call last):
Python Shell, prompt 3, line 1
# Used internally for debug sandbox under external interpreter
File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 12, in <module>
raise Exception("list must be non-empty")
builtins.Exception: list must be non-empty
look at the get_scores function in
exceptions.py code
- are there any inputs that the user could enter that would cause a problem? Specifically, cause the function to exit early?
>>> get_scores()
Enter the scores one at a time. Blank score finishes.
Enter score: 1
Enter score: banana
Traceback (most recent call last):
Python Shell, prompt 2, line 1
# Used internally for debug sandbox under external interpreter
File "/Users/drk04747/classes/cs51a/examples/exceptions.py", line 29, in <module>
scores.append(float(line))
builtins.ValueError: could not convert string to float: 'banana'
- if we enter a non-numerical value, we get a "ValueError"
- what would you like to do?
- better to prompt the user to enter a number and try again
- how can we do this?
- one way would be to check that the string is a valid number
- kind of a pain
- decimal numbers
- positive/negative
- even scientific notation is fair game, e.g. 1.3e10
- better way: handle the exception and deal with it
try/except
- we can catch an exception and deal with it using a try/except block (an "exception handler"):
- syntax:
try:
some code that could raise an exception
except ExceptionName:
what to do if exception occurs
- the code in the block is executed
- if no exception is raised
- the code finishes
- the code in the "except" block is skipped and the code keeps running
- if an exception occurs
- the code in the try block is immediately excited
- if it's of the type in the except block
- the code in the except block executes
- then the code keeps running after that
- if it's another exception, it exits
- how does this help us for the get_scores function?
- look at the get_scores_better function in
exceptions.py code
- we can handle the ValueError exception and print out an error message, but keep going
>>> get_scores_better()
Enter the scores one at a time. Blank score finishes.
Enter score: 1
Enter score: banana
Enter numbers only!
Enter score: 2
Enter score:
[1.0, 2.0]
look at print_file_stats in
exceptions.py code
- where could we get exceptions from this code?
- file doesn't exist!
- if the file is empty, then we could also get a divide by zero error
look at the print_file_stats_better function in
exceptions.py code
- if we have multiple exceptions, we can have multiple except blocks
- each block will only be executed if an exception of that type is raised
- in the case of the divide by zero error, we'll already have printed out some information (number of words, longest word, shortest word). All we want to do is not have an error raised.
pass
- certain control statements expect code to be there (e.g., if/then, try/except
- pass can be used as a non-operation: it is code, but it doesn't do anything
sets
- what is a set, e.g. a set of data points?
- an unordered collection of data
- how does this differ from a list?
- a list has a sequential order to it
- lists might have duplicates
- what operations/methods might we want from a set?
- create new/construct
- add things to the set
- ask if something belongs in the set
- intersect
- union
- remove things from the set
set class
>>> help(set)
- the first thing we see is how to create new sets
- we can construct a new set using a constructor or using {} (kind of like dictionaries)
>>> s = set()
>>> s
{}
>>> s = set([4, 3, 2, 1])
>>> s
{1, 2, 3, 4}
>>> s = {4, 3, 2 ,1}
{1, 2, 3, 4}
>>> s = set("abcd")
>>> s
{'a', 'c', 'b', 'd'}
>>> s = {1, 1, 1, 1, 2, 2}
{1, 2}
- notice that there were two constructors
- the empty constructor (set()), which created an empty set
- and a constructor that took a single parameter
- a list
- a string
- in general, any thing that we can iterate over in a for loop
- when we print out the value of s it explicitly states that it is a set
"set([1, 2, 3, 4])"
- notice that even though we may give it something where there is ordering, the ordering is NOT preserved
- set methods
- class methods can be broken down into two types of methods
- mutator methods that change the underlying object
- accessor methods that do NOT change the underlying object, but ask some question about the data and give us some information back
- from the help output, which of the following are mutator vs. accessor?
- add
- clear
- difference
- difference_update
- intersection
- intersection_update
- ...
- mutators: add, clear, different_update, intersection_update
- all of these will change the object
- accessor: difference, intersection
- these will NOT change the object
- other interesting methods
- pop
- remove
- isdisjoint
- issubset
- issuperset
- union
- update
- supports most of the methods you'd want for a set
>>> s = {1,2,3,4}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}
>>> s2 = set([4, 5, 6, 7])
>>> s2
{4, 5, 6, 7}
>>> s.difference(s2)
{1, 2, 3}
>>> s
{1, 2, 3, 4, 5}
>>> s2
{4, 5, 6, 7}
>>> s.union(s2)
{1, 2, 3, 4, 5, 6, 7}
>>> s.intersection(s2)
{4, 5}
>>> s
{1, 2, 3, 4, 5]}
>>> s2
{4, 5, 6, 7}
- we can also ask if an item is in a set
>>> 1 in s2
False
>>> 5 in s2
True
>>> "abc" in s2
False
>>> s2 in s2
False
- notice that you CANNOT index into a set (there is no order)
>>> s[0]
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
TypeError: 'set' object does not support indexing
why sets?
- seems like we could do all of these things and more with lists?
- list has all of the operations like add, pop, find that sets have
- some nice operations like union and intersection, but we could put these in the list class
- in fact, lists also support the "in" notation
>>> some_list = [1, 2, 3, 4]
>>> 4 in some_list
True
>>> "abc" in some_list
False
- why have the separate class for set?
- performance!
- write the following function:
- contains(list, item)
- returns True if the item is in the list
- false otherwise
- don't use "in" or "find"
def contains(list, item):
for thing in list:
if thing == item:
return True
return False
- If we're searching for an item and we double the size of the list, how much longer (on average) do you think it would take to run this function?
- twice as long
- we're looping through each item in the list ( O(n)! )
- computers are fast, but there still is a cost to each operation
- what if we quadrupled the size of the list?
- four times as long
- the contains function above is called a "linear" runtime function
- its runtime varies linearly with respect to the input
- can we do better than linear for finding an item?
look at
lists_vs_sets.py code
- two functions for generating data
- generate_set: generates random points and puts them into a set
- generate_list: generates random points and puts them into a list
- query_data
- generates num_queries random numbers
- uses "in" to see if they are in the data set
- times how long it takes to do num_queries
- speed_test
- generates equal sized data sets in both list and set form
- then calls query_data to see how long it takes to query each one
>>> speed_test(1000, 100)
List creation took 0.003422 seconds
Set creation took 0.003589 seconds
--
List querying took 0.002917 seconds
Set querying took 0.000194 seconds
- for small sizes, they behave fairly similarly
- as we increase the size of the set and the number of queries, however, we start to see some differences
>>> speed_test(10000, 100)
List creation took 0.023313 seconds
Set creation took 0.021885 seconds
--
List querying took 0.021288 seconds
Set querying took 0.000179 seconds
>>> speed_test(10000, 1000)
List creation took 0.020332 seconds
Set creation took 0.021198 seconds
--
List querying took 0.213577 seconds
Set querying took 0.001833 seconds
>>> speed_test(100000, 1000)
List creation took 0.186876 seconds
Set creation took 0.220910 seconds
--
List querying took 2.148366 seconds
Set querying took 0.001881 seconds
- we can better understand these by generating points as we increase the size of the set/list and then plotting them
>>> speed_data(5000, 10000, 100000, 5000)
size list set
10000 0.237790 0.001881
15000 0.358325 0.001999
20000 0.469743 0.001956
25000 0.602107 0.001916
30000 0.687776 0.001889
35000 0.824027 0.001903
40000 0.921235 0.001952
45000 1.009843 0.001912
50000 1.156059 0.001927
55000 1.386080 0.001913
60000 1.566058 0.001984
65000 1.722870 0.001936
70000 2.025138 0.001966
75000 2.363384 0.001962
80000 2.619580 0.002030
85000 2.897005 0.002054
90000 2.975576 0.001946
95000 3.418256 0.002082
- we can copy and paste this in Excel and plot it
when to use a set vs. a list?
- lists have an ordering and allow duplicates
- if you need indexing, use a list
- sets are faster for asking membership
- if you don't care about the order, use a set!