CS150 - Fall 2013 - Class 22

  • Quiz!

  • admin
       - no office hours on this afternoon

  • R
       - www.r-project.org
          - "R is a free software environment for statistical computing and graphics."
       - R is "open source". What does that mean?
          - the source code is publicly (i.e freely) availably
          - the code is developed (worked on) by a variety of people, often done for free
       - what are some other open source software projects?
          - linux
          - apache (web servers)
          - lucene (search engine)
          - mysql
       - languages have their uses: R is good for
          - statistical analysis
          - vector/matrix calculations
          - numerical analysis/processing
          - data visualization
          - being free :)
          - math/statistics
       - why look at R?
          - it's still used in many disciplines
          - to show you how easy it is to use another language now that you've learned the basics or programming

  • Using R
       - you can run R from the command-line, like we've done with Python
       - can also run it as an IDE
          - it's installed on all the macs in the lab
             - either look under applications
             - or search for it with Command+space_bar
          - you can install it for free on your own laptop from www.r-project.org
       - when R starts, you start at the interactive shell
          >
       
       - just like python, we can type statements and interact directly with R this way

          > 4
          [1] 4
          
       - R operates on vectors and matrices
          - 4 is just a vector with one item
          - R starts counting vectors at 1 (not 0)
          - [1] indicates that the first thing printed from the vector on this line is at index [1]
       
          > 4/6
          [1] 0.6666667
          > 2 * 15
          [1] 30

       - R supports the standard mathematical operations
          - notice that it does the "right" thing for division

  • working directory
       - unlike Wing, which makes any files we have open available to us in the shell, R does not do this
       - R has a working directory (similar to the working directory or current directory in the Terminal)
          - if you want to run a program within R you need to change the working directory to the directory that file is in
       - we can ask what the current directory is (similar to pwd):
          
          > getwd()
          [1] "/Users/dkauchak/classes/cs150/examples"

       - we can list the contents of the directory (similar to ls):
          > dir()
           [1] "basic_plotting.py" "bbq-function.py" "bbq-functions.py"
           [4] "bbq.py" "conditional-turtle.py" "conditionals.py"
           [7] "english.txt" "hangman.py" "histogram.py"
          [10] "lectures.txt" "lists_vs_sets_improved.py" "lists_vs_sets.py"
          [13] "lists_vs_sets.pyc" "lists.py" "memory.py"
          [16] "Modules" "multiple_returns.py" "mystery-list.py"
          [19] "optional_parameters_wrong.py" "optional_parameters.py" "pi-calculator.py"
          [22] "plotting_speeds.py" "print_vs_return.py" "R-examples"
          [25] "recursion.py" "recusion_other.py" "scores-lists.py"
          [28] "simple-functions.py" "string_basics.py" "string_basics.pyc"
          [31] "sys_args.py" "turtle_recursion.py" "turtle-examples.py"
          [34] "url_basics.py" "url_extractor.py" "while.py"
          [37] "word-stats.py"

          - notice that each new line is prepended by the index in the vector that the first thing in the line is at
             - "bbq.py" is the 4th thing in the vector

       - we can change the working directory (similar to cd):
          > setwd("R-examples")
          > getwd()
          [1] "/Users/dkauchak/classes/cs150/examples/R-examples/"

  • running "programs"
       - once you've navigated to the directory with your program in it, you can run it by typing "source":
          > dir()
          [1] "simple_function.R" "simple-functions.R" "test.txt"
          > source("simple_function.R")
          [1] "simple_functions.R program was run... yay!"

       - runs the program starting at the top
       - if there are function definitions, then those functions get defined
          > dave_add(1, 2)
          [1] 3

       - just like Python, though, we can type anything we could in a program at the shell
       - Be careful: R will complain if the file doesn't end in a blank line

  • Understanding R
       - R has many of the same constructs and functionality as Python
       - look at the examples in python_vs_r.R in R-examples code
          - What do they do?
             - you can use "help(...)" to get documentation on the code
          - What are some differences you notice between Python and R
       
       - mystery1
          - prints out the num first primes
          - '<-' is used for assignment (equivalent to '=' in Python)
          - function header:
             <function_name> <- function(parameter1, parameter2, ...)

             - this literally is creating a function and assigning it to a variable
          - blocks of code are indicated by braces {}
             - don't use ':' to indicate the beginning of a new block
             - we still indent to make the code look nice
             - BUT, it's not required!
          - while loops
             - the boolean expression goes inside parenthesis
             - again, blocks are indicated by {}
          - if statement
             - same idea as Python
          - print statement
             - it's a function, so we call it with parenthesis
       
       - mystery2
          - checks to see if the number is prime
          - for loops
             - in parenthesis
             - uses the "in" terminology, like Python
             - don't need the range function
                - : can be used to generate sequences
                > 4:10
                [1] 4 5 6 7 8 9 10
       
                - there is a function called seq that has the same functionality of range for more complicated ranges
                   > seq(10)
                   [1] 1 2 3 4 5 6 7 8 9 10
                   > seq(4, 10)
                   [1] 4 5 6 7 8 9 10
                   > seq(2, 20, 2)
                   [1] 2 4 6 8 10
          - %% is mod
          - TRUE for True and FALSE for False
          - return is a function, so we call it with parenthesis
       
       - mystery3
          - calculates pi based on random sampling
          - runif
             - randomly sample from a uniform distribution
             - three parameters: num, min, max
             - generates num random numbers between min and max (inclusive)

       - mystery4
          - reads data from a file and prints out some statistics about the data in the file
          - scan
             - reads all of the data in the file and returns it
             - returned as a vector
             - each thing in the file separated by some space is an entry in the vector
             - lots of optional arguments!
          - paste
             - no built-in concatenation of strings
             - paste concatenates each of the strings together with space in between
          - sum(data != 0)
             - all of the operations in R work on vectors
             - data != 0
                - checks each entry in the vector and puts TRUE or FALSE depending on whether it's true
                - almost all functionality works on vectors!
             - sum(data > 0)
                - TRUE counts as 1
                - FALSE counts as 0
                - counts all of the things in data that are greater than 0

       - comments with #

  • vector operations
       - Since a vector is a basic component, most of Rs functionality works on vectors
       - to create new vectors use c( , , ...)
          > x <- c(1, 2, 3)
          > x
           [1] 1 2 3
          
       - all the math operations work element-wise
          > x + x
          [1] 2 4 6
          > x * x
          [1] 1 4 9
          > x ^ 3
          [1] 1 8 27

       - slicing
          > x <- c(1, 2, 3, 4, 5)
          > x[1:3]
          [1] 1 2 3

          - indexes start at 1
          - includes the final index

       - vector operations are very efficient in R (much more efficient than looping)
       - look at mystery3b
          - generate all of the random points up from
             x <- runif(num, 0, 1)
             y <- runif(num, 0, 1)
          - z <- sqrt(x^2 + y^2)
             - for each entry of x and y
                - square it
                - add the x and y squared values
                - take the sqrt of this
             - z is then all of the distances from 0,0 to each x,y
          - sum(z < 1)
             - z < 1
                - checks each entry in the vector and puts TRUE or FALSE depending on whether the value is less than 1
             - sum
                - counts all of the things in the data that are TRUE (i.e. less than 1)
       - this version is much faster!
          > mystery3(1000000)
          [1] 3.140624
          > mystery3b(1000000)
          [1] 3.141776

  • linear regression (we didn't get to this example, but thought I'd leave it around in case you're interested)
       - what is linear regression?
          - Given x, y points we can find the line that is the "best fit" to the data
             - "best fit" is defined as the line that minimizes the sum of squared error (difference) between the lines predicted value and the actual data points
             - think of it like trying to draw a line through the data that looks best
       - why linear regression?
          - visualization
          - prediction/extrapolation
          - understanding the relationship
       - look at linear_regression.R in R-examples code
          - get some data
          - the "plot" command plots data
             - generates a nice graph
             - like plotting in Python, many ways to add functionality
             - unlike Python, there is now "show()" function, though we can keep adding stuff to the plot if we want
          - we can either use Rs built-in functions to do this or do it ourselves
          - built-in
             - the "lm" function calculates a linear model
             - we can print out the model and see the coefficients
                - a line is defined by the slope and the intercept
             - abline function adds the line to the plot
          - ourselves
             - the equations for calculating the best fit line are fairly straightforward
                - even easier to calculate in R since we can do vector calculations
             - two function
                - slope calculates the slope of the best fit line
                - intercept calculates the intercept of the best fit line
             - as before, we can use abline to add the line to the plot