Lecture 10.0 — 2016-09-29

Parsing with Applicative

This lecture is written in literate Haskell; you can download the raw source.

import Data.Char
import Control.Applicative

newtype Parser a = Parser { parse :: String -> Maybe (a,String) }

A parser is a function that takes a string and either (a) fails or (b) returns a value and an unparsed suffix of the string.

We wrote some simple parsers by hand, as in:

letter :: Char -> Parser Char
letter c = Parser $ \s ->
  case s of
    c':s' | c == c' -> Just (c,s')
    _ -> Nothing

letters :: String -> Parser String
letters str = Parser $ \s ->
  if take (length str) s == str
  then Just (str,drop (length str) s)
  else Nothing

Then we can run parse (letter 'c') "chocolate" to get Just ('c',"hocolate").

Fine—we’re just dressing up some Prelude functions. But it turns out that Parsers can be treated as Functors and applicatives, so we can write more interesting parsers using building blocks we already know:

instance Functor Parser where
  fmap f p = Parser $ \s ->
    case parse p s of
      Nothing -> Nothing
      Just (v,s') -> Just (f v,s')

instance Applicative Parser where
  pure a = Parser $ \s -> Just (a,s)
  f <*> a = Parser $ \s ->
    case parse f s of
      Nothing -> Nothing
      Just (g,s') -> parse (fmap g a) s'

So parse ((\x -> [x,x,x]) <$> parse (letter 'c')) "chocolate" yields Just ("ccc","hocolate"). Neat!

We wrote some fancier parsers this way:

letterC = letter 'c'
strCH = (\c h -> [c,h]) <$> letter 'c' <*> letter 'h'

Note that parse strCH "chocolate" yields Just ("ch","ocolate"). We can use Applicative to build up a notion of sequence. More generally, we can write a function that parses strings just using the parts we have so far:

string :: String -> Parser String
string [] = pure ""
string (c:s) = (:) <$> (letter c) <*> string s

So parse (string "choco") "chocolate" is Just ("choco","late") but parse (string "vanilla") "chocolate" is `Nothing.

We wrote another primitive parser (i.e., using the Parser constructor) to detect the end of the file:

eof :: Parser ()
eof = Parser $ \s -> if null s then Just ((),"") else Nothing

So, for example:

strCH' = (\c h _ -> [c,h]) <$> letter 'c' <*> letter 'h' <*> eof

Gives us that parse strCH' "ch" is Just ("ch",[]) but parse strCH' "chocolate" yields Nothing.

Notice how we ignored a value for eof. We can use (<*) and (*>) to save ourselves some trouble, writing, e.g.,

strCH'' = (\c h -> [c,h]) <$> letter 'c' <*> letter 'h' <* eof

Next we defined some interesting, general parsers.

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser f
  where f [] = Nothing
        f (x:xs) = if p x then Just (x,xs) else Nothing

ensure :: (a -> Bool) -> Parser a -> Parser a
ensure pred p = Parser $ \s ->
  case parse p s of
    Nothing -> Nothing
    Just (a,s') -> if pred a then Just (a,s') else Nothing

lookahead :: Parser (Maybe Char)
lookahead = Parser f
  where f [] = Just (Nothing,[])
        f (c:s) = Just (Just c,c:s)

Alternative

We wanted to write a parser for numbers. It was easy enough to write a primitive parser:

integer :: Parser Int
integer = Parser $ \s ->
  let (digits,rest) = span isDigit s in
  if null digits then Nothing else Just (read digits,rest)

But can we build one up out of parts? To do so, we introduced yet another type class, Alternative.

instance Alternative Parser where
  empty = Parser $ \s -> Nothing
  p1 <|> p2 = Parser $ \s ->
    case parse p1 s of
      Just (a,s') -> Just (a,s')
      Nothing -> parse p2 s

To understand empty and (<|>), you should of course look at the docs in Control.Applicative. But perhaps it helps to see the instance for Maybe:

instance Alternative Maybe where
  -- empty :: f a
  empty = Nothing

  -- (<|>) :: f a -> f a -> f a
  Just x <|> _ = Just x
  Nothing <|> r = r

  -- empty <|> f == f
  -- f <|> empty == f

So we really could have written our instance much more concisely:

instance Alternative Parser where
  empty = Parser $ const empty
  l <|> r = Parser $ \s -> parse l s <|> parse r s
integer' :: Parser String
integer' = read <$> someDigits
  where someDigits = (:) <$> satisfy isDigit <*> moreDigits
        moreDigits = someDigits <|> pure []

We can abstract it still further, writing:

many, some :: Alternative f => f a -> f [a]
some p = (:) <$> p <*> many p
many p = some p <|> pure []
int :: Parser Int
int = read <$> some (satisfy isDigit)

The parser many p parses zero or more occurrences of p; the parser some p parses one or more occurrences of p. (NB that many and some are actually part of the Alternative type class—you don’t need to define them.)

So, for example, we can parse a comma-delimited list of numbers where the last is greater than 0 using:

threeInts = (\n1 n2 n3 -> [n1,n2,n3]) <$> (int <* char ',') <*> (int <* char ',') <*> (ensure (>0) int)

We have parse threeInts "1,2,3" yielding Just ([1,2,3],"") but parse threeInts "1,2,0" is Nothing.

Parsing arithmetic expressions

Let’s build a parser for arithmetic expressions. We’ll keep it as an invariant that we parse spaces up before each actual phrase, so “2 + 2” and “2+2” and " 2 +2" all yield Plus (Num 2) (Num 2).

spaces :: Parser ()
spaces = many (satisfy isSpace) *> pure ()
char :: Char -> Parser Char
char c = spaces *> satisfy (==c)

So we can write parses for our arithmetic tokens as:

plus, minus, times :: Parser Char
plus = char '+'
minus = char '-'
times = char '*'

We’ll be careful (unlike in class) to allow spaces before numbers, too.

num :: Parser Int
num = spaces *> int
data Arith =
   Num Int
 | Plus Arith Arith
 | Times Arith Arith deriving Show

term, factor, atom :: Parser Arith
term = Plus <$> factor <* plus <*> term <|> factor
factor = Times <$> atom <* times <*> factor <|> atom
atom = Num <$> num <|> (char '(' *> term <* char ')')

Compare this with the CFG:

Term ::= Factor + Term | Factor
Factor ::= Atom * Factor | Atom
Atom ::= n | ( Term )

Note that we’ve left out whitespace concerns. Try to appreciate the way our code mirrors the grammar.

We can extend to negation easily:

data Arith' =
   Num' Int
 | Plus' Arith' Arith'
 | Times' Arith' Arith'
 | Neg' Arith' deriving Show

term', factor', neg', atom' :: Parser Arith'
term' =     Plus' <$> factor' <* plus <*> term'
        <|> (\a b -> Plus' a (Neg' b)) <$> factor' <* minus <*> term'
        <|> factor'
factor' = Times' <$> neg' <* times <*> factor' <|> neg'
neg' = Neg' <$> (minus *> atom') <|> atom'
atom' = Num' <$> num <|> (char '(' *> term' <* char ')')

Note how we’ve had our parser interpret subtraction as addition and negation—that’s exactly the sort of translation you did in HW02! These kinds of tricks are called “syntactic sugar”.