Lecture 10 — 2015-10-05
Parsing
This lecture is written in literate Haskell; you can download the raw source.
I opened class by mentioning the possibility of getting a scholarship to attend PLMW at POPL, a mentoring workshop at one of the top PL conferences. Email me if you’re interested.
I also mentioned my upcoming elective, 181N Software Foundations, where we’ll we using the Coq theorem prover to build verified software. It should be fun!
We spent the rest of class looking at lexers and parsers.
Lexing
We recapped the arithmetic lexer from last time.
import Data.Char
data Token =
TNum Int
| TPlus
| TMinus
| TTimes
| TLParen
| TRParen
deriving (Show, Eq)
lexer :: String -> [Token]
lexer [] = []
lexer (w:s) | isSpace w = lexer (dropWhile isSpace s)
lexer ('+':s) = TPlus:lexer s
lexer ('-':s) = TMinus:lexer s
lexer ('*':s) = TTimes:lexer s
lexer ('(':s) = TLParen:lexer s
lexer (')':s) = TRParen:lexer s
lexer s | isDigit (head s) =
let (n,s') = span isDigit s in
TNum (read n :: Int):lexer s'
lexer (n:_) = error $ "Lexer error: unexpected character " ++ [n]
Parsing
We also covered parsing in general, doing an overview of the parser for arithmetic expressions.
data ArithExp =
Num Int
| Plus ArithExp ArithExp
| Times ArithExp ArithExp
| Neg ArithExp
deriving (Eq,Show)
parse :: String -> ArithExp
parse s =
case parseTerm $ lexer s of
Right (e,[]) -> e
Right (_,ts) -> error $ "Parse error: expected EOF, found: " ++ show ts
Left e -> error $ "Parser error: " ++ e
I took particular care to note the loops in play: parseTerm
calls parseTerm'
, which is mutually recursive with parseTerm''
; there’s a similar setup with parseFactor
, parseFactor'
, and parseFactor''
.
parseTerm ts =
case parseFactor ts of
Right (f,ts) -> parseTerm' f ts
Left e -> Left e
parseTerm' lhs [] = Right (lhs, [])
parseTerm' lhs (TPlus:ts) = parseTerm'' (Plus lhs) ts
parseTerm' lhs (TMinus:ts) = parseTerm'' (Plus lhs . Neg) ts
parseTerm' lhs ts = parseFactor' lhs ts
parseTerm'' mk [] = Left $ "expected term after +/-"
parseTerm'' mk ts =
case parseFactor ts of
Right (e,ts) -> parseTerm' (mk e) ts
Left e -> Left e
parseFactor ts =
case parseNum ts of
Right (lhs,ts) -> parseFactor' lhs ts
Left e -> Left e
parseFactor' lhs (TTimes:ts) = parseFactor'' (Times lhs) ts
parseFactor' lhs ts = Right (lhs, ts)
parseFactor'' mk [] = Left $ "expected term after *"
parseFactor'' mk ts =
case parseNum ts of
Right (e,ts) -> parseFactor' (mk e) ts
Left e -> Left e
parseNum (TNum n:ts) = Right (Num n, ts)
parseNum (TLParen:ts) =
case parseTerm ts of
Right (e,TRParen:ts') -> Right (e,ts')
Right (_,ts) -> Left $ "expected right paren, found: " ++ show ts
Left e -> Left e
parseNum ts = Left $ "expected numbers or parens, found: " ++ show ts
Using alex
We also went over generated lexers and parsers. Suppose the following is the contents of a file Lexer.x
:
{
module Lexer where
}
%wrapper "basic"
$digit = 0-9 -- digits
$alpha = [a-zA-Z] -- alphabetic characters
tokens :-
$white+ ;
\( { const TLParen }
\) { const TRParen }
\\ { const TBackslash }
\. { const TDot }
$alpha [$alpha $digit \_ \']* { \s -> TId s }
{
-- Each action above has type :: String -> LCToken
-- The token type:
data LCToken =
TBackslash
| TLParen
| TRParen
| TDot
| TId String
deriving (Eq,Show)
}
Running alex on a file like the above will generate a new file, Lexer.hs
, which has a function alexScanTokens :: String -> [Token]
.
In terms of the general file structure:
- Things between
{
and}
are Haskell code. - The
$digit = [0-9]
sets up an alex variable$digit
that holds a regular expression matching decimal digits. - The code following
tokens :-
sets up a mapping from regular expressions to tokens. - The
LCToken
datatype establishes the set of tokens.
Just for contrast, here’s a manually written lexer that parses the same tokens:
data LCToken =
LCBackslash
| LCLParen
| LCRParen
| LCDot
| LCId String
deriving (Eq,Show)
lexer' :: String -> [LCToken]
lexer' [] = []
lexer' (w:s) | isSpace w = lexer' (dropWhile isSpace s)
lexer' ('\\':s) = LCBackslash:lexer' s
lexer' ('.':s) = LCDot:lexer' s
lexer' ('(':s) = LCLParen:lexer' s
lexer' (')':s) = LCRParen:lexer' s
lexer' s | isAlpha (head s) =
let (id,s') = span isAlpha s in
LCId id:lexer' s'
lexer' (n:_) = error $ "Lexer error: unexpected character " ++ [n]
Using happy
We saw two happy grammars: one for arithmetic and one for the lambda calculus. Suppose we have the following file, Parser.y
.
{
module Parser where
import Data.Char
}
%name parseTokens
%tokentype { Token }
%error { parseError }
%token
num { TNum $$ }
"+" { TPlus }
"*" { TTimes }
"-" { TMinus }
"(" { TLParen }
")" { TRParen }
%%
term :
term "+" factor { Plus $1 $3 }
| term "-" factor { Plus $1 (Neg $3) }
| factor { $1 }
factor :
factor "*" neg { Times $1 $3 }
| neg { $1 }
neg :
"-" atom { Neg $2 }
| atom { $1 }
atom :
num { Num $1 }
| "(" term ")" { $2 }
{
data ArithExp =
Num Int
| Plus ArithExp ArithExp
| Times ArithExp ArithExp
| Neg ArithExp
deriving (Eq,Show)
parse :: String -> ArithExp
parse = parseTokens . lexer
parseError tokens = error $ "Parser error: " ++ show tokens
}
Running happy Parser.y
will generate Parser.hs
with the function parseTokens :: [Token] -> ArithExp
. A few notes:
- Again, those parts between
{
and}
are Haskell syntax. The rest is happy syntax. - The part after
%token
sets up the terminals we’ll use in the grammar. The$$
variable identifies which part of a token is relevant for positional variables. These are patterns, where$$
identifies the part you’d like bound to the corresponding positional variable. - The part after
%%
defines the grammar’s nonterminals, of the formname : list of terminals and nonterminals { code for an AST } | ...
- The code for generating ASTs in each production uses positional variables to refer to tokens (and their values, as established by $$).
- The
parserError
function is necessary for happy to work.
The driver
Finally, I mentioned the Main.hs
and Makefile
I distributed with the code. These put let you test your code—use them!