CS312 - Spring 2012

CS312 - Spring 2012 - Class 8

administrative
- CS talks tomorrow starting at 12:35
- go/codegolf

databases
   - what is a database?
      - data
         - a collection of data
         - stored on disk
         - in a structured manner
      - interface
         - a well-defined way for inserting, deleting and querying the data
   - why are they useful?
      - structured data shows up in many places
      - highly optimized to answer queries on the structured data
      - deals with data that is too big to fit into memory
      - centralized location for data
      - often allow for multiple access and will maintain data consistency
   - what aren't they good at?
      - free formed querying
      - still on disk, so can be slow for some applications
   - many different kinds of databases
      - MySQL
      - Oracle
      - DB2
      - SQLite

SQL
   - structured query language
   - programming language for interacting/querying databases
   - supported by most common databases
      - there is a standard
      - however, some slight variation from database to database

SQLite
   - open source
      - you can grab your own copy: http://www.sqlite.org/
   - operates via a single file on disk
   - comes installed with many linux/unix/mac distributions
   - upsides
      - low barrier to entry
      - easy setup
   - downsides
      - doesn't have all the performance optimization of mysql and others
      - may not fare as well with multiple processes
      - no server model
   - for now, this is what we'll use, though most of the commands we'll examine are portable across databases

tables
   - a database consists of one or more tables
   - a table contains a collection of information with shared attributes
   - a table contains columns
      - columns indicate data types
      - for example, we might have a table for students with columns
         - id
         - first name
         - last name
         - address
         - city
         - gpa
         - ...
   - the columns dictate the form of the data that will be stored in this table
      - columns have names
      - columns have types (i.e. the type of data that is stored)
      - may also have other associated with them
         - what the default value is
         - whether or not a value is required
   - tables contain data
      - an entry in a table is called a record
      - each record consists a value for each column in the table
   - look at friends.db in SQL code
      - we can start sqlite by calling it with a database name on the command-line
         $ sqlite3 friends.db

      - just like the command-line prompt and the irb shell, we can issue commands interactively from there

         sqlite>

      - we can enter queries/commands to the tab
      - inside this database are multiple tables one of which is called friends
      - the friends table consists of the following columns
         - an id
         - last_name
         - first_name
         - email
         - and a profession_id (more on this soon)

examining the table entries
   - all examples were run on friends.db from SQL code
   - there are many "special" commands for sqlite that are non-standard for SQL
      - in sqlite they all start with a period, e.g. ".exit" exits
      - http://www.sqlite.org/sqlite.html
   - we can find out what tables are associated with a database using the .tables command
      sqlite> .tables
      friends professions

      - this example has two different tables
   - if we want to know how a given table is structured, we can use the .schema command
      - called by itself, we get the schema (i.e. table definition) for all tables

      sqlite> .schema
      CREATE TABLE friends (
       friend_id INTEGER PRIMARY KEY AUTOINCREMENT,
       last_name TEXT NOT NULL,
       first_name TEXT NOT NULL,
       email TEXT NOT NULL DEFAULT '',
       prof_id INTEGER /* foreign key */
      );
      CREATE TABLE professions (
       prof_id INTEGER PRIMARY KEY AUTOINCREMENT,
       name TEXT NOT NULL default ''
      );

      - or you can call it and pass it a particular table name

      sqlite> .schema friends
      CREATE TABLE friends (
       friend_id INTEGER PRIMARY KEY AUTOINCREMENT,
       last_name TEXT NOT NULL,
       first_name TEXT NOT NULL,
       email TEXT NOT NULL DEFAULT '',
       prof_id INTEGER /* foreign key */
      );

   - the schema command gives a lot of information about a table
      - it tells use the columns by name:
         friend_id
         last_name
         first_name
         email
         prof_id

      - it also tells us information about what type of data is stored in that column
         - sqlite has a limited number of "data types" compared to other databases
         - http://sqlite.org/datatype3.html
            - integer
            - text
            - numeric
            - real
            - none
         - in fact, everything is basically stored as text in an sqlite database
         - other database schemas allow you much more control over the types
            - this can be useful for type checking
            - and can result in improved performance
            - but it takes more work and finesse
      - it also tells us some other information
         - NOT NULL indicates that we must specify a value (instead of allowing it to be NULL)
         - DEFAULT allows us to specify a default value
         - PRIMARY KEY
            - requires uniqueness (though you can also use UNIQUE)
               - unique keys allow for much faster indexing
            - cannot be NULL
            - indicates to other people looking at your schema that this is the main thing to index off of
         - AUTOINCREMENT
            - if a NULL value is specified, the next integer in a sequence will be used

interacting with multiple tables
   - so far, all the queries that we've looked at have only dealt with a single table
   - often we split up the data across multiple tables
   - in our friends.db, we also have a professions table

      sqlite> .header on
      sqlite> select * from professions;
      prof_id|name
      1|Software Developer
      2|Medical Doctor
      3|Financial Analyst
      4|Chef
      5|Professor

   - the two tables are linked to eachother via the prof_id index
      - the friends table contains an prof_id index corresponding to the prof_id index in the professions table
   - why separate this data across multiple tables? why not just put this all in one table?
      - more space efficient
         - just have to store an id rather than all the data in the friends table
      - easier to manage
         - in this case, we only have a the name of a profession
         - but we may have more data associated with a profession (e.g. salary, degree required, etc.)
         - we don't want to append all this information in the friends table
             - this information is associated with the profession
      - easier/faster to interact with
         - we can query each of these tables independently if we want
      - shared information
         - there may be other tables besides friends that also index into the professions table

DISTINCT
   - sometimes we only want those entries that are unique
   - the DISTINCT keyword allows us to specify that
      sqlite> select last_name from friends;
      last_name
      Smith
      Lee
      Speaker
      Johnson
      Z
      Z

      sqlite> select distinct last_name from friends;
      last_name
      Johnson
      Lee
      Smith
      Speaker
      Z