CS51A - Fall 2019

CS51A - Fall 2019 - Class 9

Example code in this lecture

   file_basics.py
   word-stats.py
   dictionaries.py

Lecture notes

admin
- Mentor hours

files
   - what is a file?
      - a chunk of data stored on the hard disk
   - why do we need files?
      - hard-drives persist state regardless of whether the power is on or not
      - when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
         - main memory is faster, but doesn't persist when the power goes off

reading files
   - to read a file in Python we first need to open it

      - If we just want to hard-code the name and the name of the file is "some_file_name" then:

         file = open("some_file_name", "r")

      or if the name of the file is in a variable, then

         name_of_file = "some_file_name"
         file = open(name_of_file, "r")

      - open is another function that has two parameters, both strings
      - the first parameter is a *string* identifying the filename
         - be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
      - the second parameter is another string telling Python what you want to do with the file
         - "r" stands for "read", that is, we're going to read some data from the file
      - open returns a 'file' object that we can use later on for reading purposes
         - above, I've saved that in a variable called "file", but I could have called it anything else

   - once we have a file open, we can read a line at a time from the file using a for loop:

      for <variable> in <file_variable>:
         # do something

      - for each line in the file, the loop will get run
      - each time the variable will get assigned to the next line in the file
         - the line will be of type string
         - the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)

Look at line_count function in file_basics.py code
   - This is a common pattern for reading from files:

      1. open the file
         file = open(filename, "r")

      2. iterate through the file a line at a time

         for line in file:
            ...

         What you want to do as you read the file is the "..."

      3. close the file
         file.close()

   - In this case, we're just incrementing the counter, line_count, each time through the loop. The result is a count of the number of lines in the file.

   - For example, I've put some text in a file called
      This is a file
      It has some text in it
      It's not very EXCITING

   - If I run the function on the file:
      >>> line_count("basic.txt")
      3

Look at print_file_almost function in file_basics.py code
   - Again, very similar structure
   - Only difference, in this case we're printing out what each line is.
   - If we run it on our basic text file:
      >>> print_file_almost("basic.txt")
      This is a file

      It has some text in it

      It's not very EXCITING

   - Anything funny about this?
      - there are extra blank lines between the output!

   - To try and understand this, let's add some debugging statements, specifically, ass "print(len(line))" in the for and run again:

      This is a file

      15
      It has some text in it

      23
      It's not very EXCITING
      22

      - If you count the characters, there's one extra!

   - what's the problem?
      - when you read a line of from the file, you also get the end of line character
      - what's really in this file is:

      This is a file\nIt has some text in it\nIt's not very exciting

   - to fix this, we want to "strip" (i.e. remove) the end of line character: look at the print_file function in file_basics.py code

An aside: split()
   - split is a method called on a string that splits up the string
      >>> "this is a sentence with words".split()
      ['this', 'is', 'a', 'sentence', 'with', 'words']
      >>> s = "this is a sentence with words"
      >>> s.split()
      ['this', 'is', 'a', 'sentence', 'with', 'words']

      - splits based on one or more whitespace characters (spaces, tabs, end of line characters)
   - You can also optionally specify what string you want to split on instead of whitespace
      >>> s.split("s")
      ['thi', ' i', ' a ', 'entence with word', '']

What does the file_word_count function do in file_basics.py code?
   - similar to line counting
   - instead of adding 1 to the counter each time through the loop, we add "len(words)
      - words is just a list of the words in the line (assuming words are separated by whitespace)

      >>> file_word_count("basic.txt")
      14

   - I've put together a file of Wikipedia sentences. I can also run my functions on that:
      >>> line_count("wikipedia.txt")
      3856679
      >>> file_word_count("wikipedia.txt")
      97912818

      The file contains almost 4M sentences (I've put one sentence per line in the file) and 98M words!

What does the capitalize_count function do in file_basics.py code?
   - Relies on the "isupper" method:
      >>> "banana".isupper()
      False
      >>> "Banana".isupper()
      False
      >>> "BANANA".isupper()
      True

      - Returns True if the string contains all uppercase letters
   - what does "word[0].isupper()" ask then, assuming word is a string?
      - checks to see if the word starts with an uppercase letter

      >>> "Banana"[0]
      'B'
      >>> "Banana"[0].isupper()
      True

   - The whole function:
      - iterates through the words a word at a time
      - checks to see if the word is capitalized (well, begins with a capital letter). If so, increments the count

What does the file_capitalized_count function do in file_basics.py code?
   - Very similar to file_word_count
   - only difference is that instead adding len(words), we add capitalized_count(words), i.e. the number of capitalized words in the line
   - Counts the total number of capitalized words in the file
      >>> file_capitalized_count("basic.txt")
      4
      >>> file_capitalized_count("wikipedia.txt")
      16248184

look at file_stats in word-stats.py code
   - It iterates over each item in the file and keeps track of:
      - longest string found
      - shortest string found
      - total length of the strings iterated over
      - the total number of strings/items

   - how does it keep track of the longest?
      - starts with ""
      - compares every word to the longest so far
      - if longer, updates longest

   - what does 'shortest == "" or' do? Why don't we have it for the longest condition?
      - for longest, we started with the shortest possible string, so any string will be longer
      - hard to start with the longest possible string
      - instead we add a special case for the first time through the loop
         - could have initialized shortest to be a really long string, but this is a more robust solution

   - I have a file called "english.txt" which contains a list of ~47K English words. I can use this to understand some basic stats about English:
      - again, the file called "english.txt" needs to be in the same directory as the .py file

         >>> file_stats("english.txt")
         Number of words: 47158
         Longest word: antidisestablishmentarianism
         Shortest word: Hz
         Avg. word length: 8.37891768099

what does this tell us about English? Average word length is 8.3? Does that sound right?
   - seems long!
   - the problem is that it doesn't take into account word frequency. This is just a dictionary of words
   - How might we measure actual word average length in language use?
      - try and find a corpus/sample of dialogue

write a function called read_numbers that takes a file of numbers (one per line) and generates a list consisting of the numbers in that file
   - look at read_numbers function in dictionaries.py code
      - if you're reading numbers, don't forget to turn them into ints using "int"

      >>> data = read_numbers('numbers.txt')
      data
      [1, 2, 3, 2, 1, 1, 2, 6, 7, 8, 10, 1, 5, 5, 5, 3, 8, 6, 7, 6, 4, 1, 1, 2, 3, 1, 2, 3]

what if we wanted to find the most frequent value in this data?
   - how would you do it?
   - do it on paper: [1, 2, 3, 2, 3, 2, 1, 1, 5, 4, 4, 5]
      - how did you do it?
         - kept a tally of the number
         - each time you saw a new number, added it to your list with a count of 1
         - if it was something you'd seen already, add another tally/count
      - key idea, keeping track of two things:
         - a key, which is the thing you're looking up
         - a value, which is associated with each key