Lecture 10 — 2020-02-25

Applicative parsing, in depth

This lecture is written in literate Haskell; you can download the raw source.

import Data.Char
import Control.Applicative

We wrote up a straightforward instance for Maybe and a more interesting instance for Either e:

instance Applicative (Either e) where
  pure x = Right x -- because Left x would be ill typed!
  (Right f) <*> (Right v) = Right $ f v
  err@(Left e) <*> _ = err
  _ <*> err@(Left e) = err

Then we went over the Applicative definitions for lists. There were two possibilities: cartesian product…

instance Applicative [] where
  pure x = [x]

  []     <*>  _ = []
  _      <*> [] = []
  (f:fs) <*> xs = map f xs ++ fs <*> xs

…and zipping:

newtype ZipList a = ZipList { getZipList :: [a] }
  deriving (Eq, Show, Functor)

instance Applicative ZipList where
  pure = ZipList . repeat
  ZipList fs <*> ZipList xs = ZipList (zipWith ($) fs xs)

Obey the laws

Like Functor, the Applicative type class is governed by laws.

Identity: pure id <*> v = v Composition: pure (.) <*> u <*> v <*> w = u <*> (v <*> w) Homomorphism: pure f <*> pure x = pure (f x) Interchange: u <*> pure y = pure ($ y) <*> u

Note that identity is a generalization of id <$> v = v from Functor, since f <$> x = pure f <*> x.

Parsers

The parser we defined for arithmetic in lecture 9 was fine, but it had a problem: it was parsing right associatively, where x + y + z was interpreted as x + (y + z). It’s not what we want—especially if we’re going to introduce an operator like subtraction, which isn’t commutative!

The solution is something called chainl1. First, let’s recapitulate our parsers.

newtype Parser a = Parser { parse :: String -> Maybe (a,String) }

instance Functor Parser where
  fmap f p = Parser $ \s ->
    case parse p s of
      Nothing -> Nothing
      Just (v,s') -> Just (f v,s')

instance Applicative Parser where
  pure a = Parser $ \s -> Just (a,s)
  f <*> a = Parser $ \s ->  -- f :: Parser (a -> b), a :: Parser a
    case parse f s of
      Nothing -> Nothing
      Just (g,s') -> parse (fmap g a) s' -- g :: a -> b, fmap g a :: Parser b

instance Alternative Parser where
  empty = Parser $ \s -> Nothing
  p1 <|> p2 = Parser $ \s ->
    case parse p1 s of
      Just (a,s') -> Just (a,s')
      Nothing -> parse p2 s

ensure :: (a -> Bool) -> Parser a -> Parser a
ensure pred p = Parser $ \s ->
  case parse p s of
    Nothing -> Nothing
    Just (a,s') -> if pred a then Just (a,s') else Nothing

satisfy :: (Char -> Bool) -> Parser Char
satisfy p = Parser f
  where f [] = Nothing
        f (x:xs) = if p x then Just (x,xs) else Nothing

lookahead :: Parser (Maybe Char)
lookahead = Parser f
  where f [] = Just (Nothing,[])
        f (c:s) = Just (Just c,c:s)

int :: Parser Int
int = read <$> some (satisfy isDigit)

spaces :: Parser ()
spaces = many (satisfy isSpace) *> pure ()

char :: Char -> Parser Char
char c = spaces *> satisfy (==c)

str :: String -> Parser [Char]
str s = spaces *> string s
  where string []     = pure []
        string (c:s') = (:) <$> satisfy (==c) <*> string s'

num :: Parser Int
num = spaces *> int

As a warmup, we can define a parser that gives us ps separated by some number of seps, i.e., the langauge p (sep p)*.

sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = (:) <$> p <*> (many (sep *> p))

For example, parse (intsepBy1char ',') "1,2,3,4" yields Just ([1,2,3,4],"").

The sepBy1 parser doesn’t look at what sep produces, but for arithmetic, the separator will be the operator we care about. To define chainl1, the parser that gives left chaining, we need our separator to not return some arbitrary, thrown away value—we want it to tell us which operator to use! Once we’ve collected all the operators, we can apply them left-associatively by using a fold:

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p sep = foldl (\acc (op,v) -> op acc v) <$> 
                p <*> many ((\op v -> (op,v)) <$> sep <*> p)

Refactored, we can verify that our grammar still works… try it out!

data Arith =
   Num Int
 | Plus Arith Arith
 | Times Arith Arith
 | Neg Arith
 deriving Show

sub :: Arith -> Arith -> Arith
sub e1 e2 = Plus e1 (Neg e2)

term, factor, neg, atom :: Parser Arith
term = factor `chainl1` addop
  where addop =     (char '+' *> pure Plus)
                <|> (char '-' *> pure sub)
factor = neg `chainl1` mulop
  where mulop =     (char '*' *> pure Times)
neg = Neg <$> (char '-'  *> atom) <|> atom
atom = Num <$> num <|> (char '(' *> term <* char ')')

Parsing WhileNZ

Once we have a parser for terms, parsing WhileNZ is a matter of being careful about keywords and identifiers. Here’s a working parser for the basic, Pascal-like syntax.

type VarName = String

keywords :: [String]
keywords = ["WHILE", "DO", "END", "SKIP"]

kw :: String -> Parser ()
kw s = pure () <* spaces <* (sequenceA $ map (satisfy . (==)) s) -- subtly incorrect on, e.g., DOINK, which isn't a kw

var :: Parser VarName
var = ensure (not . (`elem` keywords)) (spaces *> id)
  where id = (:) <$> satisfy isAlpha <*> many (satisfy isAlphaNum)

data WhileNZ =
    Skip
  | Assign VarName Arith
  | Seq WhileNZ WhileNZ
  | WhileNZ Arith WhileNZ
  deriving Show

statement, statements :: Parser WhileNZ
statements = foldl1 Seq <$> (statement `sepBy1` char ';')
statement =     const Skip <$> kw "SKIP"
            <|> Assign <$> (var <* str ":=") <*> term
            <|> WhileNZ <$>
                  (kw "WHILE" *> (term <* kw "DO")) <*>
                  (statements <* kw "END")
            <|> (\c b -> Seq c (WhileNZ b c)) <$>
                  (kw "DO" *> (statements <* kw "WHILE")) <*>
                  (term <* kw "END")

prog1 = fst <$> parse statements "x := 0"
prog2 = fst <$> parse statements "x := 0;\nWHILE x - 5 DO x := x + 1 END; SKIP"

Digression on chainl1

Okay, but how did chainl1 work? Let’s go over it:

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p sep = 
  foldl (\acc (op,v) -> op acc v) <$> p <*> many ((\op v -> (op,v)) <$> sep <*> p)

First, let’s refactor a bit—anonymous functions are great to write when you’re on top of things, but named functions are easier to talk about when you’re not sure what’s going on. (There’s a name for this phenomenon: the Kingdom of Nouns.)

chainl1 :: Parser a -> Parser (a -> a -> a) -> Parser a
chainl1 p sep = 
  foldl applyOp <$> p <*> many (collectOp <$> sep <*> p)
  where applyOp acc (op,v) = op acc v
        collectOp op v = (op,v)

What’s this code do? If we put on our Applicative parsing glasses, we’re going to parse p then then zero or more seps followed by a p, i.e., something of the form p (sep p)*. In terms of what’s parsed, this code is just like sepBy1, which is hopefully easier to understand:

sepBy1 :: Parser a -> Parser sep -> Parser [a]
sepBy1 p sep = (:) <$> p <*> (many (sep *> p))

If we turn our attention to types (put on your typechecking hat; who knew it was going to be a costume party), we can see that parsing with p produces a value of type a, but parsing with sep produces a function of type a -> a -> a. That is, our separator produces an operator. Our function collectOp makes it so that when we’re parsing zero or more sep ps, we collect a pair of things: the operation and the right-hand value. So many (collectOp <$> sep <*> p) will parse a list of pairs of operations and values.

After parsing with p <*> many (collectOp <$> sep <*> p), we find ourselves holding two things: a value of type a, from parsing with p, and a list of type [(a->a->a,a)] from the many .... The first thing, the result from p, is our leftmost thing we’ve parsed. The list is a left-to-right listing of (i) the operation between the thing to our left and (ii) another value of type a parsed by p. We then flatten everything out by folding—using foldl, because we’re trying to be left associative.

That is, we have…

p0 [(op1,p1),(op2,p2),(op3,p3),...,(opn,pn)]

and we produce…

(opn (... (op3 (op2 (op1 p0 p1) p2) p3) ...) pn)

…which looks like the right answer to me!