Text simplification data sets

If you use this data, please send me (David Kauchak) an e-mail and let me know what project you're working on.

Wikipedia Data Sets

Two different versions of the data set now exist. Both were generated by aligning Simple English Wikipedia and English Wikipedia. A complete description of the extraction process can be found in "Simple English Wikipedia: A New Simplification Task", William Coster and David Kauchak (2011). In Proceedings of ACL (short paper).

For questions regarding either version of the data set, contact David Kauchak at Middlebury College.

Version 1.0

The original version of the data set created from Wikipeda pages downloaded in May 2010. The data set contains 137K aligned sentences pairs based on those sentences that have a similarity above 0.50. Higher precision alignments may be obtained by TF-IDF thresholding at higher levels.

Version 1.0 data    (split into train, tune and test)

Version 2.0

Updated version of the data set from Wikipedia pages downloaded in May 2011. This data set was used in "Improving Text Simplification Language Modeling Using Unsimplified Text Data", David Kauchak (2013). In Proceedings of ACL. The versions below are slightly different (larger) than the data used in the paper. For the data used in the paper, please contact me.

This data set contains two parts, a sentence-aligned part and a document aligned part:

Mechanical Turk Lexical Simplification Data Set

Available here.

For more information and to cite, see: Colby Horn, Katie Manduca and David Kauchak (2014). Learning a Lexical Simplifier Using Wikipedia. In Proceedings of ACL (short paper).