CS51A - Fall 2019

CS51A - Fall 2019 - Class 23

Example code in this lecture

   url_basics.py
   url_extractor.py
   extractor_inclass.py

Lecture notes

web pages
   - what is a web page or more specifically what's in a web page?
   - just a text file with a list of text, formatting information, commands, etc.
   - generally made up from three things:
      1) html (hypertext markup language): this is the main backbone of the page
      2) css: contains style and formatting information
      3) javascript: for handling dynamic content and other non-static functionalities
   - this text is then parsed by the web browser to display the content
   - you can view the html source of a web page from your browser
   - in Safari: View->View Source
   - in Firefox: View->Page Source
   - in Chrome: View->Developer->View Source
   - html content
   - html consists of tags (a tag starts with a '<' and ends with a '>')
   - generally tags come in pairs, with an opening tag and closing tag, e.g. <html> ... </html>
   - lots of documentation online for html
   - Four views of web pages:
      1) String of HTML bytes
      2) Tree structured document (e.g. tree of tags, list of lists of lists of...)
      3) Dynamic document/object graph
      4) Pixels

If we look at the course webpage (http://cs.pomona.edu/classes/cs51a/) we can see the html that generates it

reading from web pages using urllib.request
   - look at url_basics.py code : what does the print_data function do?
   - looks very similar to other functions we've seen before for reading data
   - key difference: we're reading from a webpage!

   - to read from a webpage, we need to open a connection to it (like opening a file)
   - there is a package urllib.requests that supports various web functionality
      - the main function we'll use is urlopen

      from urllib.request import urlopen

   - once you have a connection open, you can read it a line at a time, like from a file, etc.

   - if we run this on the course webpage we see the following output:
   >>> print_data("http://cs.pomona.edu/classes/cs51a/")
   b'...'

   - which mirrors roughly the same text we saw through our browser

   - anything different?
      - b!
   - these aren't actually strings. We can check the type by adding an extra print statement
      print(type(line))

   - if we run again with the type information printed out we see:
   <class 'bytes'>

   - bytes is another class that represents raw data
   - webpages can contain a wide range of characters (e.g., Chinese characters)
   - we need to know how to interpret the raw data to turn it into characters

A best guess: look at the print_url_data function from url_basics.py code
   - often web pages will have as metadata the character encoding to use
   - for our purposes, we'll just make a best guess at a common encoding scheme, UTF-8, which handles a fair amount of web pages (another common one is ISO-8859-1)
   - the byte class has a 'decode' method that will turn the bytes into a string
   - if we run print_url_data, we'll see that we get the same output, but now as strings:
   >>> print_data("http://cs.pomona.edu/classes/cs51a")
   '...'

look at url_extractor.py code
   - what does the get_note_urls function do?
- how is that different from the extractor_inclass.py code?

opens up the course web page
   - reads a line at a time
      - checks each line to see if it contains any lecture notes
      - if so, keeps track of it in a list

   - str.find(some_string):
      - returns the index in str where some_string occurs, or -1 if it doesn't
      - starts searching from the beginning of the string
   - str.find(some_string, start_index)
      - rather than starting at the beginning, start searching at start_index

      >>> "banana".find("ana")
      1
      >>> "banana".find("ana",2)
      3

   - what does "begin_index = line.find(search_line)" do?
      - finds where the lecture strings starts
   - what does "end_index = line.find('"', begin_index)" do?
      - searching for the end of the link

   - what does write_list_to_file do?
   - opens a file, this time with "w" instead of "r"
      - "w" stands for write
      - if the file doesn't exist it will create it
      - if the file does exists, it will erase the current contents and overwrite it (be careful!)
   - we can also write to a file without overwriting the contents, but instead appending to the end
      - "a" stands for append
   - just like with reading form a file, we get a file object from open
   - the "write" method writes an object to the file as a string
   - why do I have the "\n" appended on to the end of item?
      - write does NOT put a line return after the end of it
      - if you want one, you need to put it in yourself
   - what does write_lectures function do?
   - gets the lecture urls from the course web page
   - writes them to a outfile