# Lecture 10 — 2015-10-05

## Parsing

This lecture is written in literate Haskell; you can download the raw source.

I opened class by mentioning the possibility of getting a scholarship to attend PLMW at POPL, a mentoring workshop at one of the top PL conferences. Email me if you’re interested.

I also mentioned my upcoming elective, 181N Software Foundations, where we’ll we using the Coq theorem prover to build verified software. It should be fun!

We spent the rest of class looking at lexers and parsers.

Lexing

We recapped the arithmetic lexer from last time.

``````import Data.Char

data Token =
TNum Int
| TPlus
| TMinus
| TTimes
| TLParen
| TRParen
deriving (Show, Eq)

lexer :: String -> [Token]
lexer [] = []
lexer (w:s) | isSpace w = lexer (dropWhile isSpace s)
lexer ('+':s) = TPlus:lexer s
lexer ('-':s) = TMinus:lexer s
lexer ('*':s) = TTimes:lexer s
lexer ('(':s) = TLParen:lexer s
lexer (')':s) = TRParen:lexer s
lexer s | isDigit (head s) =
let (n,s') = span isDigit s in
TNum (read n :: Int):lexer s'
lexer (n:_) = error \$ "Lexer error: unexpected character " ++ [n]``````

Parsing

We also covered parsing in general, doing an overview of the parser for arithmetic expressions.

``````data ArithExp =
Num Int
| Plus ArithExp ArithExp
| Times ArithExp ArithExp
| Neg ArithExp
deriving (Eq,Show)

parse :: String -> ArithExp
parse s =
case parseTerm \$ lexer s of
Right (e,[]) -> e
Right (_,ts) -> error \$ "Parse error: expected EOF, found: " ++ show ts
Left e -> error \$ "Parser error: " ++ e``````

I took particular care to note the loops in play: `parseTerm` calls `parseTerm'`, which is mutually recursive with `parseTerm''`; there’s a similar setup with `parseFactor`, `parseFactor'`, and `parseFactor''`.

``````parseTerm ts =
case parseFactor ts of
Right (f,ts) -> parseTerm' f ts
Left e -> Left e

parseTerm' lhs [] = Right (lhs, [])
parseTerm' lhs (TPlus:ts) = parseTerm'' (Plus lhs) ts
parseTerm' lhs (TMinus:ts) = parseTerm'' (Plus lhs . Neg) ts
parseTerm' lhs ts = parseFactor' lhs ts

parseTerm'' mk [] = Left \$ "expected term after +/-"
parseTerm'' mk ts =
case parseFactor ts of
Right (e,ts) -> parseTerm' (mk e) ts
Left e -> Left e

parseFactor ts =
case parseNum ts of
Right (lhs,ts) -> parseFactor' lhs ts
Left e -> Left e

parseFactor' lhs (TTimes:ts) = parseFactor'' (Times lhs) ts
parseFactor' lhs ts = Right (lhs, ts)

parseFactor'' mk [] = Left \$ "expected term after *"
parseFactor'' mk ts =
case parseNum ts of
Right (e,ts) -> parseFactor' (mk e) ts
Left e -> Left e

parseNum (TNum n:ts) = Right (Num n, ts)
parseNum (TLParen:ts) =
case parseTerm ts of
Right (e,TRParen:ts') -> Right (e,ts')
Right (_,ts) -> Left \$ "expected right paren, found: " ++ show ts
Left e -> Left e
parseNum ts = Left \$ "expected numbers or parens, found: " ++ show ts``````

Using alex

We also went over generated lexers and parsers. Suppose the following is the contents of a file `Lexer.x`:

``````{
module Lexer where
}

%wrapper "basic"

\$digit = 0-9               -- digits
\$alpha = [a-zA-Z]          -- alphabetic characters

tokens :-

\$white+                         ;
\(                              { const TLParen }
\)                              { const TRParen }
\\                              { const TBackslash }
\.                              { const TDot }
\$alpha [\$alpha \$digit \_ \']*   { \s -> TId s }

{
-- Each action above has type :: String -> LCToken

-- The token type:
data LCToken =
TBackslash
| TLParen
| TRParen
| TDot
| TId String
deriving (Eq,Show)
}``````

Running alex on a file like the above will generate a new file, `Lexer.hs`, which has a function `alexScanTokens :: String -> [Token]`.

In terms of the general file structure:

• Things between `{` and `}` are Haskell code.
• The `\$digit = [0-9]` sets up an alex variable `\$digit` that holds a regular expression matching decimal digits.
• The code following `tokens :-` sets up a mapping from regular expressions to tokens.
• The `LCToken` datatype establishes the set of tokens.

Just for contrast, here’s a manually written lexer that parses the same tokens:

``````data LCToken =
LCBackslash
| LCLParen
| LCRParen
| LCDot
| LCId String
deriving (Eq,Show)

lexer' :: String -> [LCToken]
lexer' [] = []
lexer' (w:s) | isSpace w = lexer' (dropWhile isSpace s)
lexer' ('\\':s) = LCBackslash:lexer' s
lexer' ('.':s) = LCDot:lexer' s
lexer' ('(':s) = LCLParen:lexer' s
lexer' (')':s) = LCRParen:lexer' s
lexer' s | isAlpha (head s) =
let (id,s') = span isAlpha s in
LCId id:lexer' s'
lexer' (n:_) = error \$ "Lexer error: unexpected character " ++ [n]``````

Using happy

We saw two happy grammars: one for arithmetic and one for the lambda calculus. Suppose we have the following file, `Parser.y`.

``````{
module Parser where

import Data.Char
}

%name parseTokens
%tokentype { Token }
%error { parseError }

%token
num  { TNum \$\$ }
"+"  { TPlus }
"*"  { TTimes }
"-"  { TMinus }
"("  { TLParen }
")"  { TRParen }

%%

term :
term "+" factor { Plus \$1 \$3 }
| term "-" factor { Plus \$1 (Neg \$3) }
| factor          { \$1 }

factor :
factor "*" neg { Times \$1 \$3 }
| neg            { \$1 }

neg :
"-" atom { Neg \$2 }
| atom     { \$1 }

atom :
num          { Num \$1 }
| "(" term ")" { \$2 }

{

data ArithExp =
Num Int
| Plus ArithExp ArithExp
| Times ArithExp ArithExp
| Neg ArithExp
deriving (Eq,Show)

parse :: String -> ArithExp
parse = parseTokens . lexer

parseError tokens = error \$ "Parser error: " ++ show tokens

}``````

Running `happy Parser.y` will generate `Parser.hs` with the function `parseTokens :: [Token] -> ArithExp`. A few notes:

• Again, those parts between `{` and `}` are Haskell syntax. The rest is happy syntax.
• The part after `%token` sets up the terminals we’ll use in the grammar. The `\$\$` variable identifies which part of a token is relevant for positional variables. These are patterns, where `\$\$` identifies the part you’d like bound to the corresponding positional variable.
• The part after `%%` defines the grammar’s nonterminals, of the form `name : list of terminals and nonterminals { code for an AST } | ...`
• The code for generating ASTs in each production uses positional variables to refer to tokens (and their values, as established by \$\$).
• The `parserError` function is necessary for happy to work.

The driver

Finally, I mentioned the `Main.hs` and `Makefile` I distributed with the code. These put let you test your code—use them!