CS51A - Spring 2022

CS51A - Spring 2022 - Class 22

Example code in this lecture

   url_basics.py
   url_extractor.py
   url_extractor_improved.py
   url_basics_ssl.py

Lecture notes

administrative
- nim!

web pages
   - what is a web page or more specifically what's in a web page?
      - just a text file with a list of text, formatting information, commands, etc.
      - generally made up from three things:
         1) html (hypertext markup language): this is the main backbone of the page
         2) css: contains style and formatting information
         3) javascript: for handling dynamic content and other non-static functionalities
      - this text is then parsed by the web browser to display the content
   - you can view the html source of a web page from your browser
      - in Safari: View->View Source
      - in Firefox: View->Page Source
      - in Chrome: View->Developer->View Source
   - html content
      - html consists of tags (a tag starts with a '<' and ends with a '>')
      - generally tags come in pairs, with an opening tag and closing tag, e.g. <html> ... </html>
      - lots of documentation online for html

if we look at the course webpage (http://www.cs.pomona.edu/classes/cs51a/) we can see the html that generates it

reading from web pages using urllib.request
   - look at url_basics.py code : what does the print_data function do?
      - looks very similar to other functions we've seen before for reading data
      - key difference: we're reading from a webpage!

   - to read from a webpage, we need to open a connection to it (like opening a file)
      - there is a package urllib.requests that supports various web functionality
         - the main function we'll use is urlopen

         from urllib.request import urlopen

      - once you have a connection open, you can read it a line at a time, like from a file, etc.

   - if we run this on the course webpage we see the following output:
      >>> print_data("http://www.cs.pomona.edu/classes/cs51a/")
      b'...'

      - which mirrors roughly the same text we saw through our browser

      - anything different?
         - b!
      - these aren't actually strings. We can check the type by adding an extra print statement
         print(type(line))

   - if we run again with the type information printed out we see:
      <class 'bytes'>

      - bytes is another class that represents raw data
      - webpages can contain a wide range of characters (e.g., Chinese characters)
      - we need to know how to interpret the raw data to turn it into characters

A best guess: look at the print_url_data function from url_basics.py code
   - often web pages will have as metadata the character encoding to use
   - for our purposes, we'll just make a best guess at a common encoding scheme, ISO-8859-1, which handles a fair amount of web pages
   - the byte class has a 'decode' method that will turn the bytes into a string
   - if we run print_url_data, we'll see that we get the same output, but now as strings:
      >>> print_data("http://www.cs.pomona.edu/classes/cs51a/")
      '...'

look at url_extractor.py code
   - what does the get_note_urls function do?
      - opens up the course web page
      - reads a line at a time
         - checks each line to see if it contains any lecture notes
         - if so, keeps track of it in a list

      - str.find(some_string):
         - returns the index in str where some_string occurs, or -1 if it doesn't
         - starts searching from the beginning of the string
      - str.find(some_string, start_index)
         - rather than starting at the beginning, start searching at start_index

         >>> "banana".find("ana")
         1
         >>> "banana".find("ana",2)
         3

      - what does "begin_index = line.find(search_line)" do?
         - finds where the lecture strings starts
      - what does "end_index = line.find('"', begin_index)" do?
         - searching for the end of the link

   - what does write_list_to_file do?
      - opens a file, this time with "w" instead of "r"
         - "w" stands for write
         - if the file doesn't exist it will create it
         - if the file does exists, it will erase the current contents and overwrite it (be careful!)
      - we can also write to a file without overwriting the contents, but instead appending to the end
         - "a" stands for append
      - just like with reading form a file, we get a file object from open
      - the "write" method writes an object to the file as a string
      - why do I have the "\n" appended on to the end of item?
         - write does NOT put a line return after the end of it
         - if you want one, you need to put it in yourself
   - what does write_lectures function do?
      - gets the lecture urls from the course web page
      - writes them to a outfile

Revisiting url_extractor.py code
   - look at the webpage
   - look at the output: do we get *all* of the lecture note links?
      - No! We miss those with the .pptx links. Why?
      - the code assumes one lecture per line, but that's not true

   - how do we fix this?
      - rather than searching per line, treat the entire webpage as a long string
      - search for the first occurrence of lecture
      - extract it
      - then search again starter at the end of that occurrence

look get_note_urls_improved function in url_extractor_improved.py code
   - read
      - rather than reading a line at a time, we can read the entire contents all at once
      - this also works on files
   - we then decode this so that page_text has all of the webpage text as a string

   - what does "begin_index = page_text.find(search_line)" do?
      - searches for the index of the first occurrence of "lectures/"
      - will the code enter the while loop?
         - if it finds an occurrence

   - what does "end_index = page_text.find('"', begin_index)" do?
      - searches for the end of the link
   - we can then extract the url
   - what does "begin_index = page_text.find(search_line, end_index)" do?
      - searches *again*, but not starting at end_index, the end of the last link found

   - if we run the improved version, we now get the pptx links too

how could we change our code to just extract the name of the file (e.g., lecture1-intro.html)?
   - look at get_note_files_only function in url_extractor_improved.py code

   - key change: we want to skip the "lectures/" part when extracting the page
      - rather than using begin_index, we want to skip the length of "lectures/" forward when extracting

what's the difference between http and https
   - the 's' stands for secure
   - when you communicate with an https website:
      - you get some reassurance that you're actually communicating with the website (rather than someone pretending to be the website)
      - your communications are encrypted so it's difficult to see what information you're sending back and forth
   - there is a bit of overhead in setting up this communication properly
      - the right way is to install SSL certificates for python
   - for simplicity, however, you can also tell python to simply ignore the SSL certificates and connect to an https site without checking.
   - look at url_basics_ssl.py code
      - urlopen has an optional parameter that you can specify that will allow you to connect to an https webpage without checking ssl certificates