CS51A - Spring 2019 - Class 12

Example code in this lecture

   file_basics.py
   word-stats.py

Lecture notes

  • admin
       - Midterm 1
          - Actual score: (score + 3)/23
          - Q1: 13.25 (71%)
          - Q2: 16 (83%)
          - Q3: 18.375 (93%)
          - Ave: 82%

  • files
       - what is a file?
          - a chunk of data stored on the hard disk
       - why do we need files?
          - hard-drives persist state regardless of whether the power is on or not
          - when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
             - main memory is faster, but doesn't persist when the power goes off

  • reading files
       - to read a file in Python we first need to open it
       
          - If we just want to hard-code the name and the name of the file is "some_file_name" then:

             file = open("some_file_name", "r")

          or if the name of the file is in a variable, then

             name_of_file = "some_file_name"
             file = open(name_of_file, "r")

          - open is another function that has two parameters
          - the first parameter is a *string* identifying the filename
             - be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
          - the second parameter is another string telling Python what you want to do with the file
             - "r" stands for "read", that is, we're going to read some data from the file
          - open returns a 'file' object that we can use later on for reading purposes
             - above, I've saved that in a variable called "file", but I could have called it anything else

       - once we have a file open, we can read a line at a time from the file using a for loop:

          for <variable> in <file_variable>:
             # do something

          - for each line in the file, the loop will get run
          - each time the variable will get assigned to the next line in the file
             - the line will be of type string
             - the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)

  • Look at line_count function in file_basics.py code
       - This is a common pattern for reading from files:

          1. open the file
             file = open(filename, "r")

          2. iterate through the file a line at a time

             for line in file:
                ...

             What you want to do as you read the file is the "..."

          3. close the file
             file.close()

       - In this case, we're just increment the counter, line_count, each time through the loop. The result is a count of the number of lines in the file.

       - For example, I've put some text in a file called
          This is a file
          It has some text in it
          It's not very EXCITING

       - If I run the function on the file:
          >>> line_count("basic.txt", "r")
          3

  • Look at print_file_almost function in file_basics.py code
       - Again, very similar structure
       - Only difference, in this case we're printing out what each line is.
       - If we run it on our basic text file:
          >>> print_file_almost("basic.txt")
          This is a file
             
          It has some text in it

          It's not very EXCITING

       - Anything funny about this?
          - there are extra blank lines between the output!

       - To try and understand this, let's add some debugging statements, specifically, ass "print(len(line))" in the for and run again:

          This is a file

          15
          It has some text in it

          23
          It's not very EXCITING
          22
       
          - If you count the characters, there's one extra!
          
       - what's the problem?
          - when you read a line of from the file, you also get the end of line character
          - what's really in this file is:

          This is a file\nIt has some text in it\nIt's not very exciting

       - to fix this, we want to "strip" (i.e. remove) the end of line character: look at the print_file function in file_basics.py code

  • An aside: split()
       - split is a method called on a string that splits up the string
          >>> "this is a sentence with words".split()
          ['this', 'is', 'a', 'sentence', 'with', 'words']
          >>> s = "this is a sentence with words"
          >>> s.split()
          ['this', 'is', 'a', 'sentence', 'with', 'words']

          - splits based on one or more whitespace characters (spaces, tabs, end of line characters)
       - You can also optionally specify what string you want to split on instead of whitespace
          >>> s.split("s")
          ['thi', ' i', ' a ', 'entence with word', '']
          

  • What does the file_word_count function do in file_basics.py code?
       - similar to line counting
       - instead of adding 1 to the counter each time through the loop, we add "len(words)
          - words is just a list of the words in the line (assuming words are separated by whitespace)
       
          >>> file_word_count("basic.txt")
          14

       - I've put together a file of Wikipedia sentences. I can also run my functions on that:
          >>> line_count("wikipedia.txt")
          3856679
          >>> file_word_count("wikipedia.txt")
          97912818

          The file contains almost 4M sentences (I've put one sentence per line in the file) and 98M words!

  • What does the capitalize_count function do in file_basics.py code?
       - Relies on the "isupper" method:
          >>> "banana".isupper()
          False
          >>> "Banana".isupper()
          False
          >>> "BANANA".isupper()
          True

          - Returns True if the string contains all uppercase letters
       - what does "word[0].isupper()" ask then, assuming word is a string?
          - checks to see if the word starts with an uppercase letter

          >>> "Banana"[0]
          'B'
          >>> "Banana"[0].isupper()
          True
          
       - The whole function:
          - iterates through the words a word at a time
          - checks to see if the word is capitalized (well, begins with a capital letter). If so, increments the count

  • What does the file_capitalized_count function do in file_basics.py code?
       - Very similar to file_word_count
       - only difference is that instead adding len(words), we add capitalized_count(words), i.e. the number of capitalized words in the line
       - Counts the total number of capitalized words in the file
          >>> file_capitalized_count("basic.txt")
          4
          >>> file_capitalized_count("wikipedia.txt")
          16248184

  • look at file_stats in word-stats.py code
       - It iterates over each item in the file and keeps track of:
          - longest string found
          - shortest string found
          - total length of the strings iterated over
          - the total number of strings/items

       - how does it keep track of the longest?
          - starts with ""
          - compares every word to the longest so far
          - if longer, updates longest

       - what does 'shortest == "" or' do? Why don't we have it for the longest condition?
          - for longest, we started with the shortest possible string, so any string will be longer
          - hard to start with the longest possible string
          - instead we add a special case for the first time through the loop
             - could have initialized shortest to be a really long string, but this is a more robust solution

       - I have a file called "english.txt" which contains a list of ~47K English words. I can use this to understand some basic stats about English:
          - again, the file called "english.txt" needs to be in the same directory as the .py file

             >>> file_stats("english.txt")
             Number of words: 47158
             Longest word: antidisestablishmentarianism
             Shortest word: Hz
             Avg. word length: 8.37891768099

  • what does this tell us about English? Average word length is 8.3? Does that sound right?
       - seems long!
       - the problem is that it doesn't take into account word frequency. This is just a dictionary of words
       - How might we measure actual word average length in language use?
          - try and find a corpus/sample of dialogue