EMNLP 2014: Conference on Empirical Methods in Natural Language Processing — October 25–29, 2014 — Doha, Qatar.


Shared Task: Automatic Arabic Error Correction

Subscribe to the shared task discussion group

For Frequently Asked Questions , please visit the FAQ page for the shared task

Deadline for system output collection extended to July 25, 2014 .

Deadline for paper submission extended to July 28, 2014 .

Submission instructions are available here.

NEW: List of Accepted Papers

As part of the Arabic Natural Language Processing Workshop at EMNLP 2014 (to be held in Doha, Qatar), we will conduct a shared task on Automatic Arabic Error Correction. We designed this task in the traditions of high profile shared tasks in natural language processing such as CONLL's grammar/error detection and correction shared tasks in 2011-2013 and numerous machine translation campaigns by NIST/WMT/MEDAR, among others. The task relies on resources created under the Qatar Arabic Language Bank (QALB) project (currently over 1M words of manually corrected Arabic text).

A participating system in this shared task will be given Modern Standard Arabic texts, which are to be automatically corrected. The input will be provided in Arabic script and in a standard Romanization scheme, and will be annotated for part-of-speech (in three different granularities), inflectional features, clitics (which appear in 20% of Arabic words), lemmas, and English glosses. All of the input text will be preprocessed in a common way to make sure all participants have access to all of these features at no additional overhead novelty cost. The task is focused on correction as opposed to identification. There will not be an error identification task per se.

Participants need to register. Once registered, all participating teams will be provided with a common training data set, which includes common preprocessed input and corrected output. Registration link is on the Shared Task Website (see below). A common development set will also be provided. A blind test data set will be used to evaluate the output of the participating teams. An evaluation script will be provided to all the teams. Each participating team can submit up to three systems.

Participants are welcome to use additional resources and tools that are not part of the released data set. However, all such additions must be fully disclosed. Participants are expected to author a short paper (4 pages + 2 for references) describing their approach, resources and experiments. The paper needs to follow the standard format of EMNLP conference.

Important Dates

Shared task registration period: April 8, 2014 through July 1, 2014

Shared task test release: July 7, 2014

Shared task system output collection: July 25, 2014

Submission deadline (Workshop and shared task papers): July 28, 2014

Author notification: August 26, 2014

Camera Ready: September 15, 2014

Workshop: October 25, 2014

Registering to acquire the QALB Corpus

Please complete the QALB corpus release form in order to receive a link to the training data.

Shared Task Committee

Behrang Mohit (co-chair), Carnegie Mellon University Qatar
Alla Rozovskaya (co-chair), Columbia University
Wajdi Zaghouani, Carnegie Mellon University Qatar
Ossama Obeid, Carnegie Mellon University Qatar
Nizar Habash (advisor), New York University Abu Dhabi