Day18_levenshtein

Require Export DMFP.Day17_sorting.

Getting Levenshtein right

On day 9, we looked at edits as a way to talk about how related two strings were---in particular, strands of DNA bases.
We made a variety of assertions: that our various edit-producing functions were "correct", i.e., produced valid edits; that sub_edit produced edits no more expensive than naive_sub_edit; that levenshtein produces edits that were "optimal", often much better than even sub_edit.
Let's substantiate those claims!

Edit-generating functions

First, let's look at our edit-producing functions.

Definition delete_edit (src : strand) : list edit
  := map (fun _delete) src.

Definition add_edit (tgt : strand) : list edit
  := map (fun badd b) tgt.
The delete_add_edit approach is the simplest way to find edits from one strand to another: erase the first one entirely, then write in the second.
Definition delete_add_edit (src : strand) (tgt : strand) : list edit
  := delete_edit src ++ add_edit tgt.
This one's new! It produces edits from a strand to itself. We'll need it in the proof of correctness for levenshtein.
Definition copy_edit (src: strand) : list edit :=
  map (fun _copy) src.
The naive_sub_edit approach is a bit better than delete_add_edit, using substitute instead of deletes and adds.
Fixpoint naive_sub_edit (src : strand) (tgt : strand) : list edit :=
  match (src, tgt) with
  | ([], tgt)add_edit tgt
  | (src, [])delete_edit src
  | (_::src', b2::tgt')
    substitute b2 :: naive_sub_edit src' tgt'
  end.
And sub_edit is still better, using copy when it can.
Fixpoint sub_edit (src : strand) (tgt : strand) : list edit :=
  match (src, tgt) with
  | ([], tgt)add_edit tgt
  | (src, [])delete_edit src
  | (b1::src', b2::tgt')
    let edit :=
        if eq_base b1 b2
        then copy
        else substitute b2
    in
    edit :: sub_edit src' tgt'
  end.
But levenshtein's "choose the best" approach will be optimal. Here's our definition.
Fixpoint levenshtein (src tgt : strand) : list edit :=
  let fix inner tgt :=
      match (src,tgt) with
      | ([], _)add_edit tgt
      | (_, [])delete_edit src
      | (b1::src', b2::tgt')
        let copy_sub_edits := (if eq_base b1 b2 then copy else substitute b2)
                                :: levenshtein src' tgt' in
        let delete_edits := delete :: levenshtein src' tgt in
        let add_edits := add b2 :: inner tgt' in
        
        argmin3 total_cost copy_sub_edits delete_edits add_edits
      end
  in

  inner tgt.

Proving validity

Let's prove each of them valid. But what does that mean? Some edits are valid for a given src and tgt strand when applying the edits to src with apply_edits yields the tgt strand.
Definition valid_edit (src : strand) (tgt : strand) (edits : list edit) :=
  eq_strand (apply_edits src edits) tgt = true.
We could have worked propositionally here, saying:
Definition valid_edit' (src : strand) (tgt : strand) (edits : list edit) :=
  apply_edits src edits = tgt.
The eq_strand_iff lemma lets us know that valid_edit and valid_edit' are equivalent. The proofs turn out ot be marginally easier using valid_edit's approach, so let's stick with that.

Exercise: 1 star, standard (delete_edit_valid)

Lemma delete_edit_valid : (src : strand),
    valid_edit src [] (delete_edit src).
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 1 star, standard (add_edit_valid)

Lemma add_edit_valid : (tgt : strand),
    valid_edit [] tgt (add_edit tgt).
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 1 star, standard (copy_edit_valid)

Lemma copy_edit_valid : src,
    valid_edit src src (copy_edit src).
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 3 stars, standard (delete_add_valid)

This one is the hardest so far. You ought to be able to simply combine delete_edit_valid and add_edit_valid, but it doesn't quite work. (Try it and see!) What helper lemma do you need to prove about how delete_edit and ++ interact? Your lemma might need to state a property more general than what you need just for you to be able to prove it!
Before you get help, spend at least twenty minutes in front of a whiteboard, without your computer, trying to figure it out. Hint: if your lemma should mention delete_edit.

Lemma delete_add_valid : (src tgt : strand),
    valid_edit src tgt (delete_add_edit src tgt).
Proof.
  (* FILL IN HERE *) Admitted.
So much for those simpler edit-generating functions. Let's work on some more complex ones.

Exercise: 2 stars, standard (naive_sub_edit_valid)

Lemma naive_sub_edit_valid : src tgt,
    valid_edit src tgt (naive_sub_edit src tgt).
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 2 stars, standard (sub_edit_valid)

Prove that sub_edit produces valid edits.
Lemma sub_edit_valid : src tgt,
    valid_edit src tgt (sub_edit src tgt).
Proof.
  (* FILL IN HERE *) Admitted.

Proving Levenshtein valid

This next lemma will be handy for proving levenshtein_valid.

Exercise: 2 stars, standard, optional (argmin3_valid)

Lemma argmin3_valid : (src tgt : strand) (cost : list edit nat)
                             (edits1 edits2 edits3 : list edit),
    valid_edit src tgt edits1
    valid_edit src tgt edits2
    valid_edit src tgt edits3
    valid_edit src tgt (argmin3 cost edits1 edits2 edits3).
Proof.
  (* FILL IN HERE *) Admitted.
We're finally equipped to prove that Levenshtein's edit distance generates valid edits! You'll want to use Search to find relevant lemmas from before.

Exercise: 3 stars, standard (levenshtein_valid)

A word of warning: you're going to get some pretty complicated terms in your goal! Using unfold/fold in combination will help a bit, but nothing will help quite so much as patient attention. A lot of what's in your goal is noise resulting from the let fix inner, and you can often pay attention only to the front of your goal.
Lemma levenshtein_valid : (src tgt : strand),
    valid_edit src tgt (levenshtein src tgt).
Proof.
  (* FILL IN HERE *) Admitted.

Going the distance

Exercise: 3 stars, standard, optional (costs)

Prove lemmas characterizing the costs of each of our basic edit-generating functions.

Lemma add_cost : tgt,
    total_cost (add_edit tgt) = length tgt.
Proof.
  (* FILL IN HERE *) Admitted.

Lemma delete_cost : src,
    total_cost (delete_edit src) = length src.
Proof.
  (* FILL IN HERE *) Admitted.

Lemma copy_cost_0 : (src : strand),
    total_cost (copy_edit src) = 0.
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 3 stars, standard (levenshtein_distance_correct)

Let's define a levenshtein_distance that only computes the edit distance without computing edits themselves--this cost is the real thing we care about, not edits... but it'd be nice to know that the cost that levenshtein_distance produces the right cost, i.e., that it agrees with levenshtein. Let's prove it!
Fixpoint levenshtein_distance (src tgt : strand) : nat :=
  let fix inner tgt :=
      match (src,tgt) with
      | ([], _)length tgt
      | (_, [])length src
      | (b1::src', b2::tgt')
        let copy_sub_edits := (if eq_base b1 b2 then 0 else 1) + levenshtein_distance src' tgt' in
        let delete_edits := 1 + levenshtein_distance src' tgt in
        let add_edits := 1 + inner tgt' in
        
        min3 copy_sub_edits delete_edits add_edits
      end
  in
  inner tgt.

Lemma levenshtein_distance_correct : src tgt,
    total_cost (levenshtein src tgt) = levenshtein_distance src tgt.
Proof.
  (* FILL IN HERE *) Admitted.

Optimality for Levenshtein edit distance

We've proven that our various edit-finding algorithms are correct. But which is best (for our notion of edit)?
Levenshtein edit distance is optimal: it always produces the lowest cost edits possible. Now that we have a notion like , we can prove that!

Exercise: 2 stars, standard (naive_sub_worse)

As a warm up, let's show that sub_edit is better than naive_sub_edit. The intuition here is that sub_edit and naive_sub_edit are more or less identical, except sub_edit sometimes produces a copy edit while naive_sub_edit always produces substitute. Since cost copy = 0 and cost (substitute b) = 1, using sub_edit should never be worse than naive_sub_edit.
Lemma naive_sub_worse : src tgt,
    total_cost (sub_edit src tgt) total_cost (naive_sub_edit src tgt).
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 2 stars, standard (add_empty_optimal)

For any edits from the empty strand to a given tgt strand, add_edit is optimal. Why? Well, you need to build the whole strand!
Hint: the proof should go by induction on edits. You'll need to do some case analysis on the other edit and on tgt.
Lemma add_empty_optimal : (tgt : strand) (edits : list edit),
    valid_edit [] tgt edits
    total_cost (add_edit tgt) total_cost edits.
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 2 stars, standard, optional (delete_empty_optimal)

Similarly, delete_edit is optimal when editing a strand to empty.
Lemma delete_empty_optimal : (src : strand) (edits : list edit),
    valid_edit src [] edits
    total_cost (delete_edit src) total_cost edits.
Proof.
  (* FILL IN HERE *) Admitted.

Exercise: 3 stars, standard (levenshtein_optimal)

Prove that levenshtein is optimal. The following lemma will be helpful.
Lemma levenshtein_refl_copy : (src : strand),
    levenshtein src src = copy_edit src.
Proof.
  induction src as [|b src' IH].
  - reflexivity.
  - simpl. rewrite eq_base_refl. rewrite IH.
    assert (total_cost (copy :: copy_edit src') = 0). { apply copy_cost_0. }
    unfold argmin3. rewrite H. simpl.
    reflexivity.
Qed.
We've given you the structure of this (fairly challenging!) proof. Try not to be intimidated by the large formulae you might be given. Go slow and read everything. If you get confused, use a whiteboard.

Lemma levenshtein_optimal : (src tgt : strand) (edits : list edit),
    valid_edit src tgt edits
    total_cost (levenshtein src tgt) total_cost edits.
Proof.
  induction src as [|b1 src' IH1].
  - intros [|b2 tgt'] edits; unfold valid_edit; intros Hvalid.
    + (* FILL IN HERE *) admit.
    + (* FILL IN HERE *) admit.
  - induction tgt as [|b2 tgt' IH2]; unfold valid_edit; intros [|edit edits'] H.
    + (* FILL IN HERE *) admit.
    + (* FILL IN HERE *) admit.
    + (* FILL IN HERE *) admit.
    + simpl.
      destruct edit as [| | b3 | b3]; simpl in H.
      × replace (cost copy + total_cost edits') with (total_cost (copy :: edits')) by reflexivity.
        (* replace e1 with e2 by tactic is a handy way of saying:
           assert (e1 = e2) as H. { tactic } rewrite H.

           Feel free to use the replace tactic if you like it---it's
           a handy way of controlling rewriting/simplification.  *)

        apply argmin3_leb. left.
        (* FILL IN HERE *) admit.
      × replace (cost delete + total_cost edits') with (total_cost (delete :: edits')) by reflexivity.
        apply argmin3_leb. right. left.
        (* FILL IN HERE *) admit.
      × replace (cost (add b3) + total_cost edits') with (total_cost (add b3 :: edits')) by reflexivity.
        apply argmin3_leb. right. right.
        (* FILL IN HERE *) admit.
      × replace (cost (substitute b3) + total_cost edits') with (total_cost (substitute b3 :: edits')) by reflexivity.
        apply argmin3_leb. left.
        (* FILL IN HERE *) admit.
(* FILL IN HERE *) Admitted.

(* Fri Oct 16 23:37:11 PDT 2020 *)