Lecture 8: Debugging and Files

Topics

Debugging

  • Debugging: Removing bugs from programs
    • A "bug" is a behavior in the code that is not intended. Debugging is the practice of trying to find and fix bugs.
    • Usually better not to bug your program in the first place
    • (Use for elt in list or list comprehensions, etc—write less code!)
  • The quadruple_input function in debugging.py code attempts to quadruple the inputs value by adding it four times
    • However, if you run it, say with 5:

      >>> quadruple_input(5)
      6
      
      • we get the wrong answer
    • (In fact, no matter what you run it with, you always get the same answer!)

      >>> quadruple_input(2)
      6
      >>> quadruple_input(10)
      6
      
    • You might be able to look at the code and find the bug in this example. However, if you can't, you can try and add more information to your program to figure out what the problem is.
  • Adding print calls is one good way to figure out what your function is doing
    • Look at debugging_with_prints.py code
      • If we run this version, we start to see what the problem is:

        >>> quadruple_input(5)
         A: T 0 I 5
         --
         B: T 0 I 0
         C: T 0 I 0
         --
         B: T 0 I 1
         C: T 1 I 1
         --
         B: T 1 I 2
         C: T 3 I 2
         --
         B: T 3 I 3
         C: T 6 I 3
         D: T 6 I 3
         6
        
    • The problem is that we've used the input parameter as the variable in the for loop and the value is getting lost!
    • The fix is to use a different variable name here (e.g., i)
    • When you're all done debugging and your code works, make sure to remove the print statements!
    • It's worth taking ten seconds to make your print formatting nice
      • In loop vs out of loop
      • Iteration boundaries
      • Labels for positions

The Debugger

  • Use the little bug icon to run a special debugging program
    • It runs Python code in a special way that is under the control of another program
    • When the program breaks (because you asked it to with a "break point" by clicking in the gutter near the line numbers, or by calling the breakpoint() function in Python >= 3.7, or when your code hits an error), you can…
    • Step in/step out/step over the next line of code (if it's not stuck at an error, of course)
    • View call stack
    • View stack frame, i.e. variables in local scope

Submitting HW

  • Make sure you submit a file called assignN.py!
  • Make sure gradescope likes your submission! You get to know this and resubmit to fix it.
    • So check after you submit! Often it's just a matter of renaming a file, see the previous point
  • Check gradescope for assignments 1 and 2 and make sure the grades make sense
    • We always try to get grades out within a week of the due date.

Office and Mentor Hours

Even more string methods

  • a few that might be useful (though there are many there)
    • lower

      >>> s = "Banana"
      >>> s.lower()
      'banana'
      
      >>> s
      'Banana'
      
      • Remember, strings are immutable!
    • replace

      >>> s.replace("a", "o")
      'Bonono'
      
      >>> s
      'Banana'
      
      • remember that strings are immutable
      • if you want to update a variable after the method has been applied, you can do the following:

        >>> s
        'Banana'
        >>> s = s.replace("a", "o")
        >>> s
        'Bonono'
        
    • find

      >>> s.find("a")
      1
      >>> s.find("q")
      -1
      >>> s.find("a", 2)
      3
      
    • count

      >>> s.count("a")
      3
      >>> s.count("b")
      0
      
  • write a function find_letter(string, letter) which returns the index of the first occurrence of letter in string

Files

  • What is a file?
    • A chunk of data stored on the hard disk
  • Why do we need files?
    • hard-drives persist state regardless of whether the power is on or not
    • when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
    • main memory is faster, but doesn't persist when the power goes off

Reading Files

  • to read a file in Python we first need to open it
    • If we just want to hard-code the name and the name of the file is "some_file_name" then: file = open("some_file_name", "r").
      • or if the name of the file is in a variable, then

        name_of_file = "some_file_name"
        file = open(name_of_file, "r")
        
      • open is another function that has two parameters, both strings
      • the first parameter is a string identifying the filename
        • be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
      • the second parameter is another string telling Python what you want to do with the file
        • "r" stands for "read", that is, we're going to read some data from the file
      • open returns a 'file' object that we can use later on for reading purposes
        • above, I've saved that in a variable called "file", but I could have called it anything else of course
  • once we have a file open, we can read a line at a time from the file using a for loop:

    for variable in file_variable:
       # do something
    
    • for each line in the file, the loop will get run
    • each time the variable will get assigned to the next line in the file
      • the line will be of type string
      • the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)
    • But! Bug warning! Note that this "uses up" the file_variable which now points at the end of the file. This code will only print out every line once—we'll never enter the second loop!

      for line in my_file:
          print(line)
      for line in my_file:
          # This line is actually never executed!
          print(line)
      
      • To go through the file twice you need to open it twice.
  • Look at the line_count function in file_basics.py
    • This is a common pattern for reading from files:

      # 1. Open the file
      file = open(filename, "r")
      # 2. Iterate through the file one line at a time
      for line in file:
          ...
      # 3. close the file --- don't forget this part!
      file.close()
      
    • In this case, we're just incrementing the counter, line_count, each time through the loop. The result is a count of the number of lines in the file.
      • What happens if you don't close the file? Well… nothing, unless you do it a lot
    • For example, I've put some text in a file called basic.txt:

      This is a file
      It has some text in it
      It's not very EXCITING
      
    • If I run the function on the file:

      >>> line_count("basic.txt")
      3
      
  • Look at the print_file_almost function
    • Again, very similar structure
    • Only difference, in this case we're printing out what each line is.
    • If we run it on our basic text file:

      >>> print_file_almost("basic.txt")
      This is a file
      
      It has some text in it
      
      It's not very EXCITING
      
      
    • Anything funny about this?
      • there are extra blank lines between the output!
    • To try and understand this, let's add some debugging statements, specifically, ass "print(len(line))" in the for and run again:

      This is a file
      
      15
      It has some text in it
      
      23
      It's not very EXCITING
      22
      
      
    • If you count the characters, there's one extra!
      • what's the problem?
      • when you read a line of from the file, you also get the end of line character
      • what's really in this file is:

        This is a file\nIt has some text in it\nIt's not very exciting
        
    • to fix this, we want to "strip" (i.e. remove) the end of line character: look at the print_file function

split()

  • split is a method called on a string that splits up the string

    >>> "this is a sentence with words".split()
    ['this', 'is', 'a', 'sentence', 'with', 'words']
    >>> s = "this is a sentence with words"
    >>> s.split()
    ['this', 'is', 'a', 'sentence', 'with', 'words']
    
  • splits based on one or more whitespace characters (spaces, tabs, end of line characters)
  • You can also optionally specify what string you want to split on instead of whitespace

    >>> s.split("s")
    ['thi', ' i', ' a ', 'entence with word', '']
    
  • What does the file_word_count function do?
    • similar to line counting
    • instead of adding 1 to the counter each time through the loop, we add "len(words)
      • words is just a list of the words in the line (assuming words are separated by whitespace)

        >>> file_word_count("basic.txt")
        14
        
  • I've put together a file of Wikipedia sentences. I can also run my functions on that:

    >>> line_count("wikipedia.txt")
    3856679
    >>> file_word_count("wikipedia.txt")
    97912818
    
    • The file contains almost 4M sentences (I've put one sentence per line in the file) and 98M words!

More text statistics

  • What does the capitalize_count function do?
    • Relies on the "isupper" method:

      >>> "banana".isupper()
      False
      >>> "Banana".isupper()
      False
      >>> "BANANA".isupper()
      True
      
      • Returns True if the string contains all uppercase letters
    • what does word[0].isupper() ask then, assuming word is a string?

      • checks to see if the word starts with an uppercase letter

        >>> "Banana"[0]
        'B'
        >>> "Banana"[0].isupper()
        True
        
      • The whole function:
        • iterates through the words a word at a time
        • checks to see if the word is capitalized (well, begins with a capital letter). If so, increments the count
  • What does the file_capitalized_count function do?
    • Very similar to file_word_count
    • only difference is that instead of adding len(words), we add capitalized_count(words), i.e. the number of capitalized words in the line
    • Counts the total number of capitalized words in the file

      >>> file_capitalized_count("basic.txt")
      4
      >>> file_capitalized_count("wikipedia.txt")
      16248184
      
  • look at file_stats in word_stats.py
    • It iterates over each item in the file and keeps track of:
      • longest string found
      • shortest string found
      • total length of the strings iterated over
      • the total number of strings/items
    • how does it keep track of the longest?
      • starts with ""
      • compares every word to the longest so far
      • if longer, updates longest
    • what does shortest = "" or= do? Why don't we have it for the longest condition?
      • for longest, we started with the shortest possible string, so any string will be longer
      • hard to start with the longest possible string
      • instead we add a special case for the first time through the loop
        • could have initialized shortest to be a really long string, but this is a more robust solution
  • I have a file called english.txt which contains a list of ~47K English words. I can use this to understand some basic stats about English:
    • again, the file called english.txt needs to be in the same directory as the .py file

      >>> file_stats("english.txt")
      Number of words: 47158
      Longest word: antidisestablishmentarianism
      Shortest word: Hz
      Avg. word length: 8.37891768099
      
    • what does this tell us about English? Average word length is 8.3? Does that sound right?
      • seems long!
    • the problem is that it doesn't take into account word frequency. This is just a dictionary of words
    • How might we measure actual word average length in language use?
      • try and find a corpus/sample of dialogue

Measuring Frequencies

  • write a function called read_numbers that takes a file of numbers (one per line) and generates a list consisting of the numbers in that file
    • look at read_numbers in dictionaries.py
      • if you're reading numbers, don't forget to turn them into ints using "int"

        >>> data = read_numbers('numbers.txt')
        data
        [1, 2, 3, 2, 1, 1, 2, 6, 7, 8, 10, 1, 5, 5, 5, 3, 8, 6, 7, 6, 4, 1, 1, 2, 3, 1, 2, 3]
        
  • what if we wanted to find the most frequent value in this data?
    • how would you do it?
    • do it on paper: [1, 2, 3, 2, 3, 2, 1, 1, 5, 4, 4, 5]
      • how did you do it?
        • kept a tally of the number
        • each time you saw a new number, added it to your list with a count of 1
        • if it was something you'd seen already, add another tally/count
      • key idea, keeping track of two things:
        • a key, which is the thing you're looking up
        • a value, which is associated with each key

Author: Joseph C. Osborn

Created: 2020-04-21 Tue 10:44

Validate