CS150 - Fall 2011

CS150 - Fall 2011 - Class 20

quiz

admin
   - not tutor hours Tuesday night
   - my office hours
      - Mon: 2:45-5pm
      - Tue: 1-2pm
   - Final test project out soon
      - start now!
         - it will take a non-trivial amount of time
         - 15% of your grade
      - What you CAN use
         - the book
         - your notes
         - your previous labs
         - any of the online notes on the course web page
         - any of the examples from the course web page
         - any of the resources linked off of the course web page (i.e. Python documentation)
      - What you CANNOT use
         - the tutors (except for logistic issues like copying files, etc.)
         - anyone else in the class
         - the web
         - anything not in the list above
      - If you need clarification
         - come talk to me
         - if you get really stuck, I can give you a hint, but it will cost you points
      - honor code
         - please remember that this is an exam
         - I take the honor code very seriously and expect you to as well
         - copying, collaborating, sharing, etc. are NOT permitted

R
   - www.r-project.org
      - "R is a free software environment for statistical computing and graphics."
   - R is "open source". What does that mean?
      - the source code is publicly (i.e freely) availably
      - the code is developed (worked on) by a variety of people, often done for free
   - what are some other open source software projects?
      - linux
      - apache (web servers)
      - lucene (search engine)
      - java
      - mysql
   - languages have their uses: R is good for
      - statistical analysis
      - vector/matrix calculations
      - numerical analysis/processing
      - data visualization
      - being free :)
      - math/statistics
   - why look at R?
      - it's still used in many disciplines
      - to show you how easy it is to use another language now that you've learned the basics or programming

Using R
   - you can run R from the command-line, like we've done with Python
   - can also run it as an IDE
      - it's installed on all the macs in the lab
         - either look under applications
         - or search for it with Command+space_bar
      - you can install it for free on your own laptop from www.r-project.org
   - when R starts, you start at the interactive shell
      >

   - just like python, we can type statements and interact directly with R this way

      > 4
      [1] 4

   - R operates on vectors and matrices
      - 4 is just a vector with one item
      - R starts counting vectors at 1 (not 0)
      - [1] indicates that the first thing printed from the vector on this line is at index [1]

      > 4/6
      [1] 0.6666667
      > 2 * 15
      [1] 30

   - R supports the standard mathematical operations
      - notice that it does the "right" thing for division

working directory
   - unlike Wing, which makes any files we have open available to us in the shell, R does not do this
   - R has a working directory (similar to the working directory or current directory in the Terminal)
      - if you want to run a program within R you need to change the working directory to the directory that file is in
   - we can ask what the current directory is (similar to pwd):

      > getwd()
      [1] "/Users/dkauchak/classes/cs150/examples"

   - we can list the contents of the directory (similar to ls):
      > dir()
       [1] "basic_plotting.py" "bbq-function.py" "bbq-functions.py"
       [4] "bbq.py" "conditional-turtle.py" "conditionals.py"
       [7] "english.txt" "hangman.py" "histogram.py"
      [10] "lectures.txt" "lists_vs_sets_improved.py" "lists_vs_sets.py"
      [13] "lists_vs_sets.pyc" "lists.py" "memory.py"
      [16] "Modules" "multiple_returns.py" "mystery-list.py"
      [19] "optional_parameters_wrong.py" "optional_parameters.py" "pi-calculator.py"
      [22] "plotting_speeds.py" "print_vs_return.py" "R-examples"
      [25] "recursion.py" "recusion_other.py" "scores-lists.py"
      [28] "simple-functions.py" "string_basics.py" "string_basics.pyc"
      [31] "sys_args.py" "turtle_recursion.py" "turtle-examples.py"
      [34] "url_basics.py" "url_extractor.py" "while.py"
      [37] "word-stats.py"

      - notice that each new line is prepended by the index in the vector that the first thing in the line is at
         - "bbq.py" is the 4th thing in the vector

   - we can change the working directory (similar to cd):
      > setwd("R-examples")
      > getwd()
      [1] "/Users/dkauchak/classes/cs150/examples/R-examples/"

running "programs"
   - once you've navigated to the directory with your program in it, you can run it by typing "source":
      > dir()
      [1] "simple_function.R" "simple-functions.R" "test.txt"
      > source("simple_function.R")
      [1] "simple_functions.R program was run... yay!"

   - runs the program starting at the top
   - if there are function definitions, then those functions get defined
      > dave_add(1, 2)
      [1] 3

   - just like Python, though, we can type anything we could in a program at the shell
   - Be careful: R will complain if the file doesn't end in a blank line

Understanding R
   - R has many of the same constructs and functionality as Python
   - look at the examples in python_vs_r.R in R-examples code
      - What do they do?
         - you can use "help(...)" to get documentation on the code
      - What are some differences you notice between Python and R

   - mystery1
      - prints out the num first primes
      - '<-' is used for assignment (equivalent to '=' in Python)
      - function header:
         <function_name> <- function(parameter1, parameter2, ...)

         - this literally is creating a function and assigning it to a variable
      - blocks of code are indicated by braces {}
         - don't use ':' to indicate the beginning of a new block
         - we still indent to make the code look nice
         - BUT, it's not required!
      - while loops
         - the boolean expression goes inside parenthesis
         - again, blocks are indicated by {}
      - if statement
         - same idea as Python
      - print statement
         - it's a function, so we call it with parenthesis

   - mystery2
      - checks to see if the number is prime
      - for loops
         - in parenthesis
         - uses the "in" terminology, like Python
         - don't need the range function
            - : can be used to generate sequences
            > 4:10
            [1] 4 5 6 7 8 9 10

            - there is a function called seq that has the same functionality of range for more complicated ranges
               > seq(10)
               [1] 1 2 3 4 5 6 7 8 9 10
               > seq(4, 10)
               [1] 4 5 6 7 8 9 10
               > seq(2, 20, 2)
               [1] 2 4 6 8 10
      - %% is mod
      - TRUE for True and FALSE for False
      - return is a function, so we call it with parenthesis

   - mystery3
      - calculates pi based on random sampling
      - runif
         - randomly sample from a uniform distribution
         - three parameters: num, min, max
         - generates num random numbers between min and max (inclusive)

   - mystery4
      - reads data from a file and prints out some statistics about the data in the file
      - scan
         - reads all of the data in the file and returns it
         - returned as a vector
         - each thing in the file separated by some space is an entry in the vector
         - lots of optional arguments!
      - paste
         - no built-in concatenation of strings
         - paste concatenates each of the strings together with space in between
      - sum(data > 0)
         - all of the operations in R work on vectors
         - data > 0
            - checks each entry in the vector and puts TRUE or FALSE depending on whether it's true
            - almost all functionality works on vectors!
         - sum(data > 0)
            - TRUE counts as 1
            - FALSE counts as 0
            - counts all of the things in data that are greater than 0

   - comments with #

vector operations
   - Since a vector is a basic component, most of Rs functionality works on vectors
   - to create new vectors use c( , , ...)
      > x <- c(1, 2, 3)
      > x
       [1] 1 2 3

   - all the math operations work element-wise
      > x + x
      [1] 2 4 6
      > x * x
      [1] 1 4 9
      > x ^ 3
      [1] 1 8 27

   - slicing
      > x <- c(1, 2, 3, 4, 5)
      > x[1:3]
      [1] 1 2 3

      - indexes start at 1
      - includes the final index

   - vector operations are very efficient in R (much more efficient than looping)
   - look at mystery3b
      - generate all of the random points up from
         x <- runif(num, 0, 1)
         y <- runif(num, 0, 1)
      - z <- sqrt(x^2 + y^2)
         - for each entry of x and y
            - square it
            - add the x and y squared values
            - take the sqrt of this
         - z is then all of the distances from 0,0 to each x,y
      - sum(z < 1)
         - z < 1
            - checks each entry in the vector and puts TRUE or FALSE depending on whether the value is less than 1
         - sum
            - counts all of the things in the data that are TRUE (i.e. less than 1)
   - this version is much faster!
      > mystery3(1000000)
      [1] 3.140624
      > mystery3b(1000000)
      [1] 3.141776

linear regression
   - what is linear regression?
      - Given x, y points we can find the line that is the "best fit" to the data
         - "best fit" is defined as the line that minimizes the sum of squared error (difference) between the lines predicted value and the actual data points
         - think of it like trying to draw a line through the data that looks best
   - why linear regression?
      - visualization
      - prediction/extrapolation
      - understanding the relationship
   - look at linear_regression.R in R-examples code
      - get some data
      - the "plot" command plots data
         - generates a nice graph
         - like plotting in Python, many ways to add functionality
         - unlike Python, there is now "show()" function, though we can keep adding stuff to the plot if we want
      - we can either use Rs built-in functions to do this or do it ourselves
      - built-in
         - the "lm" function calculates a linear model
         - we can print out the model and see the coefficients
            - a line is defined by the slope and the intercept
         - abline function adds the line to the plot
      - ourselves
         - the equations for calculating the best fit line are fairly straightforward
            - even easier to calculate in R since we can do vector calculations
         - two function
            - slope calculates the slope of the best fit line
            - intercept calculates the intercept of the best fit line
         - as before, we can use abline to add the line to the plot