CS51A - Fall 2019 - Class 9

Example code in this lecture

   file_basics.py
   word-stats.py
   dictionaries.py

Lecture notes

  • admin
       - Mentor hours

  • files
       - what is a file?
          - a chunk of data stored on the hard disk
       - why do we need files?
          - hard-drives persist state regardless of whether the power is on or not
          - when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
             - main memory is faster, but doesn't persist when the power goes off

  • reading files
       - to read a file in Python we first need to open it
       
          - If we just want to hard-code the name and the name of the file is "some_file_name" then:

             file = open("some_file_name", "r")

          or if the name of the file is in a variable, then

             name_of_file = "some_file_name"
             file = open(name_of_file, "r")

          - open is another function that has two parameters, both strings
          - the first parameter is a *string* identifying the filename
             - be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
          - the second parameter is another string telling Python what you want to do with the file
             - "r" stands for "read", that is, we're going to read some data from the file
          - open returns a 'file' object that we can use later on for reading purposes
             - above, I've saved that in a variable called "file", but I could have called it anything else

       - once we have a file open, we can read a line at a time from the file using a for loop:

          for <variable> in <file_variable>:
             # do something

          - for each line in the file, the loop will get run
          - each time the variable will get assigned to the next line in the file
             - the line will be of type string
             - the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)

  • Look at line_count function in file_basics.py code
       - This is a common pattern for reading from files:

          1. open the file
             file = open(filename, "r")

          2. iterate through the file a line at a time

             for line in file:
                ...

             What you want to do as you read the file is the "..."

          3. close the file
             file.close()

       - In this case, we're just incrementing the counter, line_count, each time through the loop. The result is a count of the number of lines in the file.

       - For example, I've put some text in a file called
          This is a file
          It has some text in it
          It's not very EXCITING

       - If I run the function on the file:
          >>> line_count("basic.txt")
          3

  • Look at print_file_almost function in file_basics.py code
       - Again, very similar structure
       - Only difference, in this case we're printing out what each line is.
       - If we run it on our basic text file:
          >>> print_file_almost("basic.txt")
          This is a file
             
          It has some text in it

          It's not very EXCITING

       - Anything funny about this?
          - there are extra blank lines between the output!

       - To try and understand this, let's add some debugging statements, specifically, ass "print(len(line))" in the for and run again:

          This is a file

          15
          It has some text in it

          23
          It's not very EXCITING
          22
       
          - If you count the characters, there's one extra!
          
       - what's the problem?
          - when you read a line of from the file, you also get the end of line character
          - what's really in this file is:

          This is a file\nIt has some text in it\nIt's not very exciting

       - to fix this, we want to "strip" (i.e. remove) the end of line character: look at the print_file function in file_basics.py code

  • An aside: split()
       - split is a method called on a string that splits up the string
          >>> "this is a sentence with words".split()
          ['this', 'is', 'a', 'sentence', 'with', 'words']
          >>> s = "this is a sentence with words"
          >>> s.split()
          ['this', 'is', 'a', 'sentence', 'with', 'words']

          - splits based on one or more whitespace characters (spaces, tabs, end of line characters)
       - You can also optionally specify what string you want to split on instead of whitespace
          >>> s.split("s")
          ['thi', ' i', ' a ', 'entence with word', '']
          

  • What does the file_word_count function do in file_basics.py code?
       - similar to line counting
       - instead of adding 1 to the counter each time through the loop, we add "len(words)
          - words is just a list of the words in the line (assuming words are separated by whitespace)
       
          >>> file_word_count("basic.txt")
          14

       - I've put together a file of Wikipedia sentences. I can also run my functions on that:
          >>> line_count("wikipedia.txt")
          3856679
          >>> file_word_count("wikipedia.txt")
          97912818

          The file contains almost 4M sentences (I've put one sentence per line in the file) and 98M words!

  • What does the capitalize_count function do in file_basics.py code?
       - Relies on the "isupper" method:
          >>> "banana".isupper()
          False
          >>> "Banana".isupper()
          False
          >>> "BANANA".isupper()
          True

          - Returns True if the string contains all uppercase letters
       - what does "word[0].isupper()" ask then, assuming word is a string?
          - checks to see if the word starts with an uppercase letter

          >>> "Banana"[0]
          'B'
          >>> "Banana"[0].isupper()
          True
          
       - The whole function:
          - iterates through the words a word at a time
          - checks to see if the word is capitalized (well, begins with a capital letter). If so, increments the count

  • What does the file_capitalized_count function do in file_basics.py code?
       - Very similar to file_word_count
       - only difference is that instead adding len(words), we add capitalized_count(words), i.e. the number of capitalized words in the line
       - Counts the total number of capitalized words in the file
          >>> file_capitalized_count("basic.txt")
          4
          >>> file_capitalized_count("wikipedia.txt")
          16248184

  • look at file_stats in word-stats.py code
       - It iterates over each item in the file and keeps track of:
          - longest string found
          - shortest string found
          - total length of the strings iterated over
          - the total number of strings/items

       - how does it keep track of the longest?
          - starts with ""
          - compares every word to the longest so far
          - if longer, updates longest

       - what does 'shortest == "" or' do? Why don't we have it for the longest condition?
          - for longest, we started with the shortest possible string, so any string will be longer
          - hard to start with the longest possible string
          - instead we add a special case for the first time through the loop
             - could have initialized shortest to be a really long string, but this is a more robust solution

       - I have a file called "english.txt" which contains a list of ~47K English words. I can use this to understand some basic stats about English:
          - again, the file called "english.txt" needs to be in the same directory as the .py file

             >>> file_stats("english.txt")
             Number of words: 47158
             Longest word: antidisestablishmentarianism
             Shortest word: Hz
             Avg. word length: 8.37891768099

  • what does this tell us about English? Average word length is 8.3? Does that sound right?
       - seems long!
       - the problem is that it doesn't take into account word frequency. This is just a dictionary of words
       - How might we measure actual word average length in language use?
          - try and find a corpus/sample of dialogue

  • write a function called read_numbers that takes a file of numbers (one per line) and generates a list consisting of the numbers in that file
       - look at read_numbers function in dictionaries.py code
          - if you're reading numbers, don't forget to turn them into ints using "int"

          >>> data = read_numbers('numbers.txt')
          data
          [1, 2, 3, 2, 1, 1, 2, 6, 7, 8, 10, 1, 5, 5, 5, 3, 8, 6, 7, 6, 4, 1, 1, 2, 3, 1, 2, 3]

  • what if we wanted to find the most frequent value in this data?
       - how would you do it?
       - do it on paper: [1, 2, 3, 2, 3, 2, 1, 1, 5, 4, 4, 5]
          - how did you do it?
             - kept a tally of the number
             - each time you saw a new number, added it to your list with a count of 1
             - if it was something you'd seen already, add another tally/count
          - key idea, keeping track of two things:
             - a key, which is the thing you're looking up
             - a value, which is associated with each key