CS150 - Fall 2012 - Class 20
exercises
admin
- no office hours on Wednesday
- Final test project out soon
- start now!
- it will take a non-trivial amount of time
- 15% of your grade
- What you CAN use
- the book
- your notes
- your previous labs
- any of the online notes on the course web page
- any of the examples from the course web page
- any of the resources linked off of the course web page (i.e. Python documentation)
- What you CANNOT use
- the tutors (except for logistic issues like copying files, etc.)
- anyone else in the class
- the web
- anything not in the list above
- If you need clarification
- come talk to me
- if you get really stuck, I can give you a hint, but it will cost you points
- honor code
- please remember that this is an exam
- I take the honor code very seriously and expect you to as well
- copying, collaborating, sharing, etc. are NOT permitted
R
- www.r-project.org
- "R is a free software environment for statistical computing and graphics."
- R is "open source". What does that mean?
- the source code is publicly (i.e freely) availably
- the code is developed (worked on) by a variety of people, often done for free
- what are some other open source software projects?
- linux
- apache (web servers)
- lucene (search engine)
- java
- mysql
- languages have their uses: R is good for
- statistical analysis
- vector/matrix calculations
- numerical analysis/processing
- data visualization
- being free :)
- math/statistics
- why look at R?
- it's still used in many disciplines
- to show you how easy it is to use another language now that you've learned the basics or programming
Using R
- you can run R from the command-line, like we've done with Python
- can also run it as an IDE
- it's installed on all the macs in the lab
- either look under applications
- or search for it with Command+space_bar
- you can install it for free on your own laptop from www.r-project.org
- when R starts, you start at the interactive shell
>
- just like python, we can type statements and interact directly with R this way
> 4
[1] 4
- R operates on vectors and matrices
- 4 is just a vector with one item
- R starts counting vectors at 1 (not 0)
- [1] indicates that the first thing printed from the vector on this line is at index [1]
> 4/6
[1] 0.6666667
> 2 * 15
[1] 30
- R supports the standard mathematical operations
- notice that it does the "right" thing for division
working directory
- unlike Wing, which makes any files we have open available to us in the shell, R does not do this
- R has a working directory (similar to the working directory or current directory in the Terminal)
- if you want to run a program within R you need to change the working directory to the directory that file is in
- we can ask what the current directory is (similar to pwd):
> getwd()
[1] "/Users/dkauchak/classes/cs150/examples"
- we can list the contents of the directory (similar to ls):
> dir()
[1] "basic_plotting.py" "bbq-function.py" "bbq-functions.py"
[4] "bbq.py" "conditional-turtle.py" "conditionals.py"
[7] "english.txt" "hangman.py" "histogram.py"
[10] "lectures.txt" "lists_vs_sets_improved.py" "lists_vs_sets.py"
[13] "lists_vs_sets.pyc" "lists.py" "memory.py"
[16] "Modules" "multiple_returns.py" "mystery-list.py"
[19] "optional_parameters_wrong.py" "optional_parameters.py" "pi-calculator.py"
[22] "plotting_speeds.py" "print_vs_return.py" "R-examples"
[25] "recursion.py" "recusion_other.py" "scores-lists.py"
[28] "simple-functions.py" "string_basics.py" "string_basics.pyc"
[31] "sys_args.py" "turtle_recursion.py" "turtle-examples.py"
[34] "url_basics.py" "url_extractor.py" "while.py"
[37] "word-stats.py"
- notice that each new line is prepended by the index in the vector that the first thing in the line is at
- "bbq.py" is the 4th thing in the vector
- we can change the working directory (similar to cd):
> setwd("R-examples")
> getwd()
[1] "/Users/dkauchak/classes/cs150/examples/R-examples/"
running "programs"
- once you've navigated to the directory with your program in it, you can run it by typing "source":
> dir()
[1] "simple_function.R" "simple-functions.R" "test.txt"
> source("simple_function.R")
[1] "simple_functions.R program was run... yay!"
- runs the program starting at the top
- if there are function definitions, then those functions get defined
> dave_add(1, 2)
[1] 3
- just like Python, though, we can type anything we could in a program at the shell
- Be careful: R will complain if the file doesn't end in a blank line
Understanding R
- R has many of the same constructs and functionality as Python
- look at the examples in python_vs_r.R in
R-examples code
- What do they do?
- you can use "help(...)" to get documentation on the code
- What are some differences you notice between Python and R
- mystery1
- prints out the num first primes
- '<-' is used for assignment (equivalent to '=' in Python)
- function header:
<function_name> <- function(parameter1, parameter2, ...)
- this literally is creating a function and assigning it to a variable
- blocks of code are indicated by braces {}
- don't use ':' to indicate the beginning of a new block
- we still indent to make the code look nice
- BUT, it's not required!
- while loops
- the boolean expression goes inside parenthesis
- again, blocks are indicated by {}
- if statement
- same idea as Python
- print statement
- it's a function, so we call it with parenthesis
- mystery2
- checks to see if the number is prime
- for loops
- in parenthesis
- uses the "in" terminology, like Python
- don't need the range function
- : can be used to generate sequences
> 4:10
[1] 4 5 6 7 8 9 10
- there is a function called seq that has the same functionality of range for more complicated ranges
> seq(10)
[1] 1 2 3 4 5 6 7 8 9 10
> seq(4, 10)
[1] 4 5 6 7 8 9 10
> seq(2, 20, 2)
[1] 2 4 6 8 10
- %% is mod
- TRUE for True and FALSE for False
- return is a function, so we call it with parenthesis
- mystery3
- calculates pi based on random sampling
- runif
- randomly sample from a uniform distribution
- three parameters: num, min, max
- generates num random numbers between min and max (inclusive)
- mystery4
- reads data from a file and prints out some statistics about the data in the file
- scan
- reads all of the data in the file and returns it
- returned as a vector
- each thing in the file separated by some space is an entry in the vector
- lots of optional arguments!
- paste
- no built-in concatenation of strings
- paste concatenates each of the strings together with space in between
- sum(data != 0)
- all of the operations in R work on vectors
- data != 0
- checks each entry in the vector and puts TRUE or FALSE depending on whether it's true
- almost all functionality works on vectors!
- sum(data > 0)
- TRUE counts as 1
- FALSE counts as 0
- counts all of the things in data that are greater than 0
- comments with #
vector operations
- Since a vector is a basic component, most of Rs functionality works on vectors
- to create new vectors use c( , , ...)
> x <- c(1, 2, 3)
> x
[1] 1 2 3
- all the math operations work element-wise
> x + x
[1] 2 4 6
> x * x
[1] 1 4 9
> x ^ 3
[1] 1 8 27
- slicing
> x <- c(1, 2, 3, 4, 5)
> x[1:3]
[1] 1 2 3
- indexes start at 1
- includes the final index
- vector operations are very efficient in R (much more efficient than looping)
- look at mystery3b
- generate all of the random points up from
x <- runif(num, 0, 1)
y <- runif(num, 0, 1)
- z <- sqrt(x^2 + y^2)
- for each entry of x and y
- square it
- add the x and y squared values
- take the sqrt of this
- z is then all of the distances from 0,0 to each x,y
- sum(z < 1)
- z < 1
- checks each entry in the vector and puts TRUE or FALSE depending on whether the value is less than 1
- sum
- counts all of the things in the data that are TRUE (i.e. less than 1)
- this version is much faster!
> mystery3(1000000)
[1] 3.140624
> mystery3b(1000000)
[1] 3.141776
linear regression
- what is linear regression?
- Given x, y points we can find the line that is the "best fit" to the data
- "best fit" is defined as the line that minimizes the sum of squared error (difference) between the lines predicted value and the actual data points
- think of it like trying to draw a line through the data that looks best
- why linear regression?
- visualization
- prediction/extrapolation
- understanding the relationship
- look at linear_regression.R in
R-examples code
- get some data
- the "plot" command plots data
- generates a nice graph
- like plotting in Python, many ways to add functionality
- unlike Python, there is now "show()" function, though we can keep adding stuff to the plot if we want
- we can either use Rs built-in functions to do this or do it ourselves
- built-in
- the "lm" function calculates a linear model
- we can print out the model and see the coefficients
- a line is defined by the slope and the intercept
- abline function adds the line to the plot
- ourselves
- the equations for calculating the best fit line are fairly straightforward
- even easier to calculate in R since we can do vector calculations
- two function
- slope calculates the slope of the best fit line
- intercept calculates the intercept of the best fit line
- as before, we can use abline to add the line to the plot