EF Education First Research Lab - University of Cambridge

EFCAMDAT Corpus

The EF-Cambridge Open Language Database (EFCAMDAT) is the largest open-access corpus of English learner essays. It comprises submissions from students worldwide who attend an online EF school. Learners are assigned to proficiency levels based on their initial placement test results or through successful course progression. The 16 proficiency levels, aligned with the Common European Framework of Reference for Languages (CEFR), each consist of eight lessons designed to enhance reading, listening, speaking, and writing skills. EFCAMDAT includes scripts from writing tasks at the end of each lesson, covering topics like "writing a resume" and "giving budgeting advice."

In its first release, the corpus contained 551,036 scripts from 84,864 learners. The second release expanded to 1,180,310 texts from 174,743 learners. A cleaned subcorpus was also created, containing only texts from levels 1 to 15 by learners from the 11 most represented nationalities.

Academic researchers can request access to the second release of the corpus (in XML format), the cleaned subcorpus (with error annotations) (in XLSX format), and the list of task prompts .

User agreement

Use the link below to download the user agreement as a PDF file.
User Agreement (PDF file)

Request access

Follow the link below to submit an application to access the corpus. Please note that an academic affiliation and access to Google Drive are necessary to use the corpus. Thus, you need to authenticate with your university email with a Google account to access the corpus request form.
Corpus Access Request Form

If you need to set up a Google account with your academic email address, you may refer to the instructions here. Alternatively, check the official instructions here (See the section titled "Can I use an existing email address?").

Download corpus

Follow the link below to download the EFCAMDAT Corpus files. Note that your application (above) will need to be approved by administrators before you can access the Google Drive. In the unlikely event that you think your request has been missed, please resbumit the Corpus Access Request Form.

Corpus Files (Google Drive)

Inside the folder, you can find the XML file of the original corpus, a cleaned sub-corpus by Shatz (2020), a cleaned error-coded subcorpus by Öksüz et al. (under review) and diagnostic models (proficiency and linguistic profiling) trained using the corpus (Stearns et al., 2025).

No longer using the corpus?

Follow the link below to submit a request for the administrator to remove your data.
Corpus Withdrawal Request Form

Get in touch

If you have any difficulty accessing the corpus or have any questions, please email the EFCAMDAT corpus administrator Rory Leung.

Citations

Please cite the following when using the EFCAMDAT data:

Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R.T. Millar, K.I. Martin, C.M. Eddington, A. Henery, N.M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.

Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html

Please cite following if you are using the cleaned sub-corpus:

Shatz, I. (2020). Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale English learner language database. International Journal of Learner Corpus Research, 6(2), 220-236. doi:10.1075/ijlcr.20009.sha

Please cite following if you are using the cleaned, parts-of-speech-tagged and error-coded sub-corpus:

Öksüz, D., Derkach, K., & Alexopoulou, T. Tsimpli, I. M. (2025). The influence of L1 typology on the acquisition of the L2 English article: A large-scale corpus study. Second Language Research.

Please cite the following if you are using the diagnostic models created with the corpus data:

Stearns, B., Ballier, N., Gaillat, T., Simpkin, A., & Mccrae, J. (2024). Evaluating the Generalisation of an Artificial Learner. Proceedings of the 13th Workshop on Natural Language Processing for Computer Assisted Language Learning, 199–208. https://aclanthology.org/2024.nlp4call-1.15.pdf

Please cite the following if you are using the collocation dataset created with the corpus data:

Wolter, B., Cooper, C. R., & Nicklin, C. (under review). The effect of node word properties and proficiency on L2 verb-noun collocation production.