Lecture 10.0 — 2016-09-29
Parsing with Applicative
This lecture is written in literate Haskell; you can download the raw source.
import Data.Char
import Control.Applicative
newtype Parser a = Parser { parse :: String -> Maybe (a,String) }
A parser is a function that takes a string and either (a) fails or (b) returns a value and an unparsed suffix of the string.
We wrote some simple parsers by hand, as in:
letter :: Char -> Parser Char
letter c = Parser $ \s ->
case s of
c':s' | c == c' -> Just (c,s')
_ -> Nothing
letters :: String -> Parser String
letters str = Parser $ \s ->
if take (length str) s == str
then Just (str,drop (length str) s)
else Nothing
Then we can run parse (letter 'c') "chocolate"
to get Just ('c',"hocolate")
.
Fine—we’re just dressing up some Prelude functions. But it turns out that Parsers can be treated as Functors and applicatives, so we can write more interesting parsers using building blocks we already know:
instance Functor Parser where
fmap f p = Parser $ \s ->
case parse p s of
Nothing -> Nothing
Just (v,s') -> Just (f v,s')
instance Applicative Parser where
pure a = Parser $ \s -> Just (a,s)
f <*> a = Parser $ \s ->
case parse f s of
Nothing -> Nothing
Just (g,s') -> parse (fmap g a) s'
So parse ((\x -> [x,x,x]) <$> parse (letter 'c')) "chocolate"
yields Just ("ccc","hocolate")
. Neat!
We wrote some fancier parsers this way:
letterC = letter 'c'
strCH = (\c h -> [c,h]) <$> letter 'c' <*> letter 'h'
Note that parse strCH "chocolate"
yields Just ("ch","ocolate")
. We can use Applicative to build up a notion of sequence. More generally, we can write a function that parses strings just using the parts we have so far:
string :: String -> Parser String
string [] = pure ""
string (c:s) = (:) <$> (letter c) <*> string s
So parse (string "choco") "chocolate"
is Just ("choco","late")
but parse (string "vanilla") "chocolate"
is `Nothing.
We wrote another primitive parser (i.e., using the Parser
constructor) to detect the end of the file:
eof :: Parser ()
eof = Parser $ \s -> if null s then Just ((),"") else Nothing
So, for example:
strCH' = (\c h _ -> [c,h]) <$> letter 'c' <*> letter 'h' <*> eof
Gives us that parse strCH' "ch"
is Just ("ch",[])
but parse strCH' "chocolate"
yields Nothing
.
Notice how we ignored a value for eof
. We can use (<*)
and (*>)
to save ourselves some trouble, writing, e.g.,
strCH'' = (\c h -> [c,h]) <$> letter 'c' <*> letter 'h' <* eof
Next we defined some interesting, general parsers.
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser f
where f [] = Nothing
f (x:xs) = if p x then Just (x,xs) else Nothing
ensure :: (a -> Bool) -> Parser a -> Parser a
ensure pred p = Parser $ \s ->
case parse p s of
Nothing -> Nothing
Just (a,s') -> if pred a then Just (a,s') else Nothing
lookahead :: Parser (Maybe Char)
lookahead = Parser f
where f [] = Just (Nothing,[])
f (c:s) = Just (Just c,c:s)
Alternative
We wanted to write a parser for numbers. It was easy enough to write a primitive parser:
integer :: Parser Int
integer = Parser $ \s ->
let (digits,rest) = span isDigit s in
if null digits then Nothing else Just (read digits,rest)
But can we build one up out of parts? To do so, we introduced yet another type class, Alternative
.
instance Alternative Parser where
empty = Parser $ \s -> Nothing
p1 <|> p2 = Parser $ \s ->
case parse p1 s of
Just (a,s') -> Just (a,s')
Nothing -> parse p2 s
To understand empty
and (<|>)
, you should of course look at the docs in Control.Applicative
. But perhaps it helps to see the instance for Maybe
:
instance Alternative Maybe where
-- empty :: f a
empty = Nothing
-- (<|>) :: f a -> f a -> f a
Just x <|> _ = Just x
Nothing <|> r = r
-- empty <|> f == f
-- f <|> empty == f
So we really could have written our instance much more concisely:
instance Alternative Parser where
empty = Parser $ const empty
l <|> r = Parser $ \s -> parse l s <|> parse r s
integer' :: Parser String
integer' = read <$> someDigits
where someDigits = (:) <$> satisfy isDigit <*> moreDigits
moreDigits = someDigits <|> pure []
We can abstract it still further, writing:
many, some :: Alternative f => f a -> f [a]
some p = (:) <$> p <*> many p
many p = some p <|> pure []
int :: Parser Int
int = read <$> some (satisfy isDigit)
The parser many p
parses zero or more occurrences of p
; the parser some p
parses one or more occurrences of p
. (NB that many
and some
are actually part of the Alternative
type class—you don’t need to define them.)
So, for example, we can parse a comma-delimited list of numbers where the last is greater than 0 using:
threeInts = (\n1 n2 n3 -> [n1,n2,n3]) <$> (int <* char ',') <*> (int <* char ',') <*> (ensure (>0) int)
We have parse threeInts "1,2,3"
yielding Just ([1,2,3],"")
but parse threeInts "1,2,0"
is Nothing
.
Parsing arithmetic expressions
Let’s build a parser for arithmetic expressions. We’ll keep it as an invariant that we parse spaces up before each actual phrase, so “2 + 2” and “2+2” and " 2 +2" all yield Plus (Num 2) (Num 2)
.
spaces :: Parser ()
spaces = many (satisfy isSpace) *> pure ()
char :: Char -> Parser Char
char c = spaces *> satisfy (==c)
So we can write parses for our arithmetic tokens as:
plus, minus, times :: Parser Char
plus = char '+'
minus = char '-'
times = char '*'
We’ll be careful (unlike in class) to allow spaces before numbers, too.
num :: Parser Int
num = spaces *> int
data Arith =
Num Int
| Plus Arith Arith
| Times Arith Arith deriving Show
term, factor, atom :: Parser Arith
term = Plus <$> factor <* plus <*> term <|> factor
factor = Times <$> atom <* times <*> factor <|> atom
atom = Num <$> num <|> (char '(' *> term <* char ')')
Compare this with the CFG:
Term ::= Factor + Term | Factor
Factor ::= Atom * Factor | Atom
Atom ::= n | ( Term )
Note that we’ve left out whitespace concerns. Try to appreciate the way our code mirrors the grammar.
We can extend to negation easily:
data Arith' =
Num' Int
| Plus' Arith' Arith'
| Times' Arith' Arith'
| Neg' Arith' deriving Show
term', factor', neg', atom' :: Parser Arith'
term' = Plus' <$> factor' <* plus <*> term'
<|> (\a b -> Plus' a (Neg' b)) <$> factor' <* minus <*> term'
<|> factor'
factor' = Times' <$> neg' <* times <*> factor' <|> neg'
neg' = Neg' <$> (minus *> atom') <|> atom'
atom' = Num' <$> num <|> (char '(' *> term' <* char ')')
Note how we’ve had our parser interpret subtraction as addition and negation—that’s exactly the sort of translation you did in HW02! These kinds of tricks are called “syntactic sugar”.