Lecture 8: Debugging and Files
Topics
Debugging
- Debugging: Removing bugs from programs
- A "bug" is a behavior in the code that is not intended. Debugging is the practice of trying to find and fix bugs.
- Usually better not to bug your program in the first place
- (Use
for elt in list
or list comprehensions, etc—write less code!)
- The
quadruple_input
function in debugging.py code attempts to quadruple the inputs value by adding it four timesHowever, if you run it, say with 5:
>>> quadruple_input(5) 6
- we get the wrong answer
(In fact, no matter what you run it with, you always get the same answer!)
>>> quadruple_input(2) 6 >>> quadruple_input(10) 6
- You might be able to look at the code and find the bug in this example. However, if you can't, you can try and add more information to your program to figure out what the problem is.
- Adding
print
calls is one good way to figure out what your function is doing- Look at debugging_with_prints.py code
If we run this version, we start to see what the problem is:
>>> quadruple_input(5) A: T 0 I 5 -- B: T 0 I 0 C: T 0 I 0 -- B: T 0 I 1 C: T 1 I 1 -- B: T 1 I 2 C: T 3 I 2 -- B: T 3 I 3 C: T 6 I 3 D: T 6 I 3 6
- The problem is that we've used the input parameter as the variable in the for loop and the value is getting lost!
- The fix is to use a different variable name here (e.g.,
i
) - When you're all done debugging and your code works, make sure to remove the print statements!
- It's worth taking ten seconds to make your print formatting nice
- In loop vs out of loop
- Iteration boundaries
- Labels for positions
- Look at debugging_with_prints.py code
The Debugger
- Use the little bug icon to run a special debugging program
- It runs Python code in a special way that is under the control of another program
- When the program breaks (because you asked it to with a "break point" by clicking in the gutter near the line numbers, or by calling the
breakpoint()
function in Python >= 3.7, or when your code hits an error), you can… - Step in/step out/step over the next line of code (if it's not stuck at an error, of course)
- View call stack
- View stack frame, i.e. variables in local scope
Submitting HW
- Make sure you submit a file called
assignN.py
! - Make sure gradescope likes your submission! You get to know this and resubmit to fix it.
- So check after you submit! Often it's just a matter of renaming a file, see the previous point
- Check gradescope for assignments 1 and 2 and make sure the grades make sense
- We always try to get grades out within a week of the due date.
Office and Mentor Hours
Even more string methods
- a few that might be useful (though there are many there)
lower
>>> s = "Banana" >>> s.lower() 'banana' >>> s 'Banana'
- Remember, strings are immutable!
replace
>>> s.replace("a", "o") 'Bonono' >>> s 'Banana'
- remember that strings are immutable
if you want to update a variable after the method has been applied, you can do the following:
>>> s 'Banana' >>> s = s.replace("a", "o") >>> s 'Bonono'
find
>>> s.find("a") 1 >>> s.find("q") -1 >>> s.find("a", 2) 3
count
>>> s.count("a") 3 >>> s.count("b") 0
- write a function
find_letter(string, letter)
which returns the index of the first occurrence of letter in string
Files
- What is a file?
- A chunk of data stored on the hard disk
- Why do we need files?
- hard-drives persist state regardless of whether the power is on or not
- when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
- main memory is faster, but doesn't persist when the power goes off
Reading Files
- to read a file in Python we first need to open it
- If we just want to hard-code the name and the name of the file is "some_file_name" then:
file = open("some_file_name", "r")
.or if the name of the file is in a variable, then
name_of_file = "some_file_name" file = open(name_of_file, "r")
open
is another function that has two parameters, both strings- the first parameter is a string identifying the filename
- be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
- the second parameter is another string telling Python what you want to do with the file
- "r" stands for "read", that is, we're going to read some data from the file
open
returns a 'file' object that we can use later on for reading purposes- above, I've saved that in a variable called "file", but I could have called it anything else of course
- If we just want to hard-code the name and the name of the file is "some_file_name" then:
once we have a file open, we can read a line at a time from the file using a for loop:
for variable in file_variable: # do something
- for each line in the file, the loop will get run
- each time the variable will get assigned to the next line in the file
- the line will be of type string
- the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)
But! Bug warning! Note that this "uses up" the
file_variable
which now points at the end of the file. This code will only print out every line once—we'll never enter the second loop!for line in my_file: print(line) for line in my_file: # This line is actually never executed! print(line)
- To go through the file twice you need to open it twice.
- Look at the
line_count
function in file_basics.pyThis is a common pattern for reading from files:
# 1. Open the file file = open(filename, "r") # 2. Iterate through the file one line at a time for line in file: ... # 3. close the file --- don't forget this part! file.close()
- In this case, we're just incrementing the counter,
line_count
, each time through the loop. The result is a count of the number of lines in the file.- What happens if you don't close the file? Well… nothing, unless you do it a lot
For example, I've put some text in a file called
basic.txt
:This is a file It has some text in it It's not very EXCITING
If I run the function on the file:
>>> line_count("basic.txt") 3
- Look at the
print_file_almost
function- Again, very similar structure
- Only difference, in this case we're printing out what each line is.
If we run it on our basic text file:
>>> print_file_almost("basic.txt") This is a file It has some text in it It's not very EXCITING
- Anything funny about this?
- there are extra blank lines between the output!
To try and understand this, let's add some debugging statements, specifically, ass "print(len(line))" in the for and run again:
This is a file 15 It has some text in it 23 It's not very EXCITING 22
- If you count the characters, there's one extra!
- what's the problem?
- when you read a line of from the file, you also get the end of line character
what's really in this file is:
This is a file\nIt has some text in it\nIt's not very exciting
- to fix this, we want to "strip" (i.e. remove) the end of line character: look at the
print_file
function
split()
split is a method called on a string that splits up the string
>>> "this is a sentence with words".split() ['this', 'is', 'a', 'sentence', 'with', 'words'] >>> s = "this is a sentence with words" >>> s.split() ['this', 'is', 'a', 'sentence', 'with', 'words']
- splits based on one or more whitespace characters (spaces, tabs, end of line characters)
You can also optionally specify what string you want to split on instead of whitespace
>>> s.split("s") ['thi', ' i', ' a ', 'entence with word', '']
- What does the
file_word_count
function do?- similar to line counting
- instead of adding 1 to the counter each time through the loop, we add "len(words)
words is just a list of the words in the line (assuming words are separated by whitespace)
>>> file_word_count("basic.txt") 14
I've put together a file of Wikipedia sentences. I can also run my functions on that:
>>> line_count("wikipedia.txt") 3856679 >>> file_word_count("wikipedia.txt") 97912818
- The file contains almost 4M sentences (I've put one sentence per line in the file) and 98M words!
More text statistics
- What does the
capitalize_count
function do?Relies on the "isupper" method:
>>> "banana".isupper() False >>> "Banana".isupper() False >>> "BANANA".isupper() True
- Returns True if the string contains all uppercase letters
what does
word[0].isupper()
ask then, assuming word is a string?checks to see if the word starts with an uppercase letter
>>> "Banana"[0] 'B' >>> "Banana"[0].isupper() True
- The whole function:
- iterates through the words a word at a time
- checks to see if the word is capitalized (well, begins with a capital letter). If so, increments the count
- What does the
file_capitalized_count
function do?- Very similar to
file_word_count
- only difference is that instead of adding
len(words)
, we addcapitalized_count(words)
, i.e. the number of capitalized words in the line Counts the total number of capitalized words in the file
>>> file_capitalized_count("basic.txt") 4 >>> file_capitalized_count("wikipedia.txt") 16248184
- Very similar to
- look at
file_stats
in word_stats.py- It iterates over each item in the file and keeps track of:
- longest string found
- shortest string found
- total length of the strings iterated over
- the total number of strings/items
- how does it keep track of the longest?
- starts with ""
- compares every word to the longest so far
- if longer, updates longest
- what does
shortest =
"" or= do? Why don't we have it for the longest condition?- for longest, we started with the shortest possible string, so any string will be longer
- hard to start with the longest possible string
- instead we add a special case for the first time through the loop
- could have initialized shortest to be a really long string, but this is a more robust solution
- It iterates over each item in the file and keeps track of:
- I have a file called
english.txt
which contains a list of ~47K English words. I can use this to understand some basic stats about English:again, the file called
english.txt
needs to be in the same directory as the .py file>>> file_stats("english.txt") Number of words: 47158 Longest word: antidisestablishmentarianism Shortest word: Hz Avg. word length: 8.37891768099
- what does this tell us about English? Average word length is 8.3? Does that sound right?
- seems long!
- the problem is that it doesn't take into account word frequency. This is just a dictionary of words
- How might we measure actual word average length in language use?
- try and find a corpus/sample of dialogue
Measuring Frequencies
- write a function called
read_numbers
that takes a file of numbers (one per line) and generates a list consisting of the numbers in that file- look at
read_numbers
in dictionaries.pyif you're reading numbers, don't forget to turn them into ints using "int"
>>> data = read_numbers('numbers.txt') data [1, 2, 3, 2, 1, 1, 2, 6, 7, 8, 10, 1, 5, 5, 5, 3, 8, 6, 7, 6, 4, 1, 1, 2, 3, 1, 2, 3]
- look at
- what if we wanted to find the most frequent value in this data?
- how would you do it?
- do it on paper: [1, 2, 3, 2, 3, 2, 1, 1, 5, 4, 4, 5]
- how did you do it?
- kept a tally of the number
- each time you saw a new number, added it to your list with a count of 1
- if it was something you'd seen already, add another tally/count
- key idea, keeping track of two things:
- a key, which is the thing you're looking up
- a value, which is associated with each key
- how did you do it?