For this assignment, you will:
ArrayList
and Association
classesIn this assignment, we will use a basic technique in artificial intelligence to automatically generate text. You will write a program that reads in a piece of text and then uses that text as a basis to generate new text. The method for generating new text uses simple probability.
First, we read in a piece of text word by word (we consider punctuation symbols to be words) keeping track of how often each three-word sequence (trigram) appears. For example, consider the following excerpt from Rudyard Kipling’s poem “If”:
If you can keep your head when all about you
Are losing theirs and blaming it on you,
If you can trust yourself when all men doubt you,
But make allowance for their doubting too;
If you can wait and not be tired by waiting
In this excerpt, the trigrams are: “if you can”, “you can keep”, “can keep your”, “keep your head”, “your head when”, “head when all”, etc.
Once we have counted all of the trigrams, we can compute the probability that a word w 3 will immediately follow two other words (w 1 w 2 ) using the following equation:
p(w3|w1,w2) = n123n12
where n123 is the number of times we have observed the sequence w1w2w3 and n12 is the number of times we have observed w1w2.
For example, let’s compute the probability that the word “can” will immediately follow the words “if you.” In the excerpt above, the sequence “if you can” occurs three times. The sequence “if you” also occurs three times. So the probability is given by,
p(can|if you) = 3/3 = 1
The probability that any other word comes immediately after “if you” is 0. Consider another example. In the excerpt above, the words “you can” appear 3 times. The first time followed by “keep”, the second time by “trust” and the last time by “wait.” Thus,
p(keep|you can) = 1/3
p(trust|you can) = 1/3
p(wait|you can) = 1/3
Again, the probability that any other word besides “keep,” “trust,” or “wait” appears after “you can” is zero.
Once we have the text processed, and stored in a data structure that allows us to compute probabilities, we can generate new text. We first pick two words to start with (for example, the first two words in the input text). Given these two words, and the probabilities computed above, we use a random number generator to choose the next word. Continue this process until the new text is 400 words long (including punctuation). When printing the new text, please generate a new line after every 20 words so that you don’t just generate one very long, unreadable, line.
Note: There may be two words that were seen in the text but were never followed by any word. For example, in the excerpt of the Kipling poem above, the words “by waiting” occur once but are never followed by a word (since they are the last two words in the excerpt). In this case, you should think up a creative solution so that the text continues.
Warning: This assignment is significantly harder than the last one and will take a lot of time. Please start early to save yourself lots of grief! The correctness of the assignments in this class will be automatically verified. For this reason, you must follow all naming conventions specified in this assignment.
FreqList
classWe suggest that you write the FreqList
class first. FreqList
should contain an array list of Associations
. ArrayList
is part of the java.util
package with documentation found in the Java 8 Javadoc pages linked to from the handout page on the course website. The Association
class is part of Bailey’s structure5
package. Javadoc for this class is also available from a different link on the handout page.
An Association
has both a key and a value. The key is a word and the value is the number of times the word occurs. When a word is added, if it already occurs in the array list then its value (i.e. its frequency) is incremented by 1. If it doesn’t exist, add the word to the array list with a value (i.e. frequency) of 1.
Your FreqList
class should have an instance variable that keep tracks of the number of words added. This is equivalent to the sum of all the values (i.e. frequencies) of all the Associations
in the array list.
The FreqList
class should also have a method with the following definition:
public String get(double p)
This method takes a double p
as an input and returns a word from the array list. The input p
must be between 0.0 and 1.0, otherwise the method throws an IllegalArgumentException. How can we use p
to generate a word? In our example above, the FreqList
for the words “you can” would look like: {<“keep”, 1>, <“trust”, 1>, <“wait”, 1>}. The sum of all of the frequencies is 3. Thus, we will return “keep” whenever 0 ≤ p
< 1/3, “trust” whenever 1/3 ≤ p
< 2/3 and “wait” whenever 2/3 ≤ p
≤ 1. If the frequency list is empty, this method should return an empty string.
Write this class and test it thoroughly by adding a main method to make sure all methods work correctly. I suggest printing out a representation of the frequency list first to make sure that the table is correct before attempting to write or test the probabilistic get
method.
TextGenerator
classThe TextGenerator
class should also contain an array list of Associations
. Again, an Association
has a key and a value. In this case, the key is a sequence of two words (e.g., “you can”). The value is an object of type FreqList
. We have included a class StringPair
that can be used to represent a sequence of two words. Thus, the array list in TextGenerator
should have interface type:
List<Assocation<StringPair, FreqList>>
We have provided you with startup code in the main method of class TextGenerator
that will pop up a dialog box to allow the user to choose a file containing the input text.
We have included some text files (ending with suffix “.txt”) in the assignment folder that you can use to test your program. We’ve tried to pick files with sufficient repetition of triples.
After the input has been processed you should generate new text. You may start with a fixed pair of words or choose two words randomly. Generate and print a string of at least 400 words so that we can see how your program works. When printing the text, please generate a new line after every 20 words so that you don’t just generate one very long, unreadable, line.
Finally, we’d like you to print the table of frequencies in some reasonable fashion so we can check to see if your table is correct on our test input. Here are some lines of output based on the lyrics for Donovan’s “Blowin’ in the wind”:
The table size is 122
<Association: <how,many>=Frequency List: <Association: roads=1><Association: seas=1>
<Association: times=3><Association: years=2><Association: ears=1><Association: deaths=1>>
<Association: <many,roads>=Frequency List: <Association: must=1>>
<Association: <roads,must>=Frequency List: <Association: a=1>>
<Association: <must,a>=Frequency List: <Association: man=2><Association: white=1>>
...
Note that I cheated and manually wrapped the first line of output so that it is readable. Your output need not wrap. Note that most of this output comes from the toString
methods of Association
and ArrayList
, though I did some extra work to print each entry in the array list on a separate line.
This output indicates that after the word pair, “how many,” the words “roads,” “seas,” “times,” “years,” “ears,” and “deaths” each occurred once. Whereas after “many roads”, only “must” occurred and it only occurred once.
Warning: As a general rule it is good to be as specific as possible with import statements. Thus, when you import stuff from both java.util
and structure5
use:
import java.util.Random;
import java.util.ArrayList;
import structure5.Association;
If your program included import java.util.*;
along with import structure5.*
the program might get confused as to which version of classes it should use as there are several classes in java.util
that have the same names as those in structure5
. It is best to be safe even when we don’t have conflicts.
WordStream
classThis class is already implemented for you. It processes a text file and breaks up the input into separate words that can be requested using the method nextToken
. In the main method of the TextGenerator
class, there is an object of type WordStream
called ws. You can get successive words from ws
by calling ws.nextToken()
repeatedly. Before getting a new word, always check that there is a word available by evaluating ws.hasMoreTokens()
, which will return true
if there are more words available. If you call nextToken()
when there are no more words available (i.e., you’ve exhausted the input), then it will throw an IndexOutOfBoundsException
.
StringPair
classThis class is already implemented for you. This class provides objects that consist of a pair of strings. These should be used as the keys to your Associations
.
Pair
classThis class is already implemented for you. This is a generic class that StringPair
extends by specializing the type variables to both be Strings
. You should not directly use this class in your program. Instead use the more compact name StringPair
from the more specialized class defined above. We provided this class only to show how easy it is to define a generic class that can be instantiated in many ways.
Association
classThis class is part of Bailey’s structure5
package. There is one piece of information about the Association
class that we did not highlight in class that you will likely find very useful. The equals
method in Association
only compares key values. That is if two objects from class Association
are “equal” according to the method, then their keys are the same, but their values may be different. Thus, to find out if an association with a given key is in an arraylist
(and where it is), just create a new association with that key and use the indexOf
method to determine its index in the list. If -1 is returned, then it is not there, whereas if a non-negative integer is returned then you get the index of an association with that key in the arraylist.
FreqList
class need? For the get method in the FreqList
class, take a piece of paper and create different examples of frequency lists. For each example, work out what the get method would return for different values of p
(e.g. if the frequency list contained 4 words that each occurred once and p
= 0.45, what word would be returned?)src
directory. The starter code can be found at /common/cs/cs062/assignments/assignment02
. This directory contains a subdirectory named wordsGeneric
which is the package containing the Java starter code for this project. Copy over this entire directory.FreqList
class. Again, try and develop incrementally. That is, get one small piece working and then move on to another piece. Add a main method to the FreqList
class in order to test and ensure the correctness of your code.You will be graded based on the following criteria:
Criterion | Points |
FreqList
|
4 |
TextGenerator
|
3 |
Prints FreqList
|
2 |
Prints at least 400 words of generated text | 3 |
Appropriate comments (including JavaDoc) | 3 |
Appropriate use of data structures | 2 |
Style and formatting | 2 |
General correctness | 2 |
Submitted correctly | 1 |
Extra credit | 1 |
NOTE: Code that does not compile will not be accepted! Make sure that your code compiles before you submit it.
src
and bin
subdirectories with the appropriate .java
and .class
files.Assignment02_LastNameFirstName
.asg02.json
file from the /common/cs/cs062/assignments/assignment02
folder, where the starter code was, and fill it out with your username in the collaborators list.
asg02.json
file into your submission folder at the top level.There are many ways in which you might extend such a program. For example, as described, word pairs which never appeared in the input will never appear in the output. Is there a way you could introduce a bit of “mutation” to allow new pairs to appear?
In case you do attempt an extra credit extension, please rename your exported directory as Assignment02 LastNameFirstName ExtraCredit so that we know you did it. You should also indicate in comments wherever you attempted an extra credit.
These “trigrams” of words are used in natural language processing in order to categorize kinds of articles and their authors. Think about how you might use tables like those given here to help determine if two articles are written by the same author.