First Workshop on Computational Approaches to Code Switching


Code-switching (CS) is the phenomenon by which multilingual
speakers switch back and forth between their common languages in
written or spoken communication. CS is typically present on the
inter sentential, intra sentential (mixing of words from multiple
languages in the same utterance) and even morphological (mixing of
morphemes) levels. CS presents serious challenges for language
technologies, including parsing, Machine Translation (MT), automatic
speech recognition (ASR), information retrieval (IR) and extraction
(IE), and semantic processing. Traditional techniques trained for
one language quickly break down when there is input mixed in from
another. Even for problems that are considered solved, such as
language identification, or part of speech tagging, performance will
degrade at a rate proportional to the amount and level of
mixed-language present.

CS is pervasive in informal text communications such as news
groups, tweets, blogs, and other social media of multilingual
communities. Such genres are increasingly being studied as rich
sources of social, commercial and political information. Apart from
the informal genre challenge associated with such data within a
single language processing scenario, the CS phenomenon adds another
significant layer of complexity to the processing of the data.
Efficiently and robustly processing CS data presents a new frontier
for our NLP algorithms on all levels. This workshop aims to bring
together researchers interested in solving the problem and to
increase awareness of the community at large with possible viable
solutions to reduce the complexity of the phenomenon.

The workshop invites contributions from researchers working in
NLP approaches for the analysis and/processing of mixed-language
data especially with a focus on intra sentential code switching.
Topics of relevance to the workshop will include the following:

  • Development of linguistic resources to support research on
    code switched data
  • NLP approaches for language identification in code switched
  • NLP techniques for the syntactic analysis of code switched
  • Domain/dialect/genre adaptation techniques applied to code
    switched data processing
  • Language modeling approaches to code switch data processing
  • Crowdsourcing approaches for the annotation of code
    switched data
  • Machine translation approaches for code switched data
  • Position papers discussing the challenges of code switched
    data to NLP techniques
  • Methods for improving ASR in code switched data
  • Survey papers of NLP research for code switched data
  • Sociolinguistic aspects of code switching
  • Sociopragmatic aspects of code switching

Shared Task: Language Identification in Code-Switched (CS)

You thought language identification was a solved problem?
Think again! Recent research has shown that fine-grained language
identification is still a challenge, and is particularly error prone
when the spans of text are smaller. Now imagine you have more than
one language in those small text spans! We are organizing a shared
task on language identification of CS data. The goal is to allow
participants to explore the use of unsupervised and supervised
approaches to detection of language at the word level in
code-switching data. We will release a small gold standard data for
tunning systems in four language pairs, Spanish-English, Modern
Standard Arabic and Arabic dialects, Chinese-English and

Task Definition

For each word in the Source, identify whether it is Lang1, Lang2,
Mixed, Other, Ambiguous, or NE (for named entities, which are proper
names that represent names of people, places, organizations,
locations, movie titles, and song titles). For more details, please
see the annotation guidelines for
. The focus of the task is on microblog data, so we
will use Twitter as the source of data, although each language
combination will have data from a “surprise genre” as additional
test data as well.

Participants for this shared task will be required to submit
output of their systems following the schedule proposed below in
order to qualify for evaluation under the shared task. They will
also be required to submit a paper describing their system.

Since we’re using Twitter data we’re following the now usual
procedure to release labeled data that other researchers have used.
Participants can use their own scripts or download our python script
to collect the data directly from Twitter and we will release char
offsets with the label information.

Please join our google group to receive announcements and other
relevant information for the workshop: [email protected]

To register your team please follow this link: Registration Form

Data Release

The script to crawl Twitter data is this one: twitter. You
will need to have
Beautiful Soup
installed for this python script to work.

A second method to crawl Twitter data using the Twitter API is also
available: Twitter
via API
. You will need to have the Launchy gem for Ruby installed,
which can be done via ‘gem install launchy’ in the command line. You
will also need a Twitter account to authenticate with the

For the Arabic and English-Spanish tweets, there are packages
available that retrieves, tokenizes and synchronizes the tags for
the training data: Arabic
Tweets Token Assigner
and English-Spanish
Tweets Token Assigner
. Instructions on how to use the packages are

The Spanish-English tweets were tokenized using the CMU ARK Twitter
Part-of-Speech Tagger v0.3
(ignoring the parts of speech) with some
later adjustments. These adjustments were made using the TweetTokenizer
Perl module. The ARK Twitter tokenizer takes an entire tweet on one
line, so initially run the onelineFile() subroutine on your file.
Feed the output into the tokenizeFile() subroutine, which runs the
tokenizer and makes adjustments. You will need to change the
tokenizer location global variable in the module to your file

The task will be evaluated using the script and calculation library
given here.
The script is run using the produced offset file and the test offset
file and produces a variety of evaluation metrics at the tweet and
token level. See the documentation inside of the script for more
details. Keep the directory structure within the Evaluation file the
same for the script to work properly.

The training and test data have been run through two benchmark
systems to give a better idea of performance goals. The systems are
a simple lexical ID approach using the training data and an
off-the-shelf system, LangID, using mass amounts of monolingual
tweet data.
(Ben King and Steven Abney. Labeling the
languages of words in mixed-language documnts. In Proceedings of the
North American Association for Computational Linguistics 2013,
The results for these benchmark systems
(obtained using the evaluation script) are provided below.

The shared task has now begun. The test data may be found
below. Remember that the task window closes on July 27th.

For Spanish-English, Nepali-English, and Modern Standard
Arabic-Arabic dialects, “suprise genre” datasets have been provided.
The “suprise genre” datasets are comprised of data from Facebook,
blogs, and Arabic commentaries. Because the data comes from
different social media sources, the ID format varies from file to
file. Unlike Twitter, you will not be given a way to crawl the data
for the raw posts. Instead, each file contains the token referenced
by the offsets.

Additional “surprise genre” data has been added for Spanish-English
and Nepali-English as of 8/10/14.
**UPDATED 8/10/14**

To submit your results, please add the label, separated by a tab, at
the end of each row of the provided test data file and submit it to
[email protected]. Please do not change the order of the rows
and do not add extra newlines.

Important Dates

  • Trial data release: March
    12, 2014
  • Training data release: April
    30, 2014
  • Task window: July 21-27,
  • Results posted: August 8,
  • Second Task window:
    August 13-17, 2014
  • Sencod Task Results posted: August 18, 2014
  • Workshop paper: July 29,
  • Task papers:
    September 1, 2014
  • Notification for Workshop papers:
    August 26, 2014
  • Notification for task papers:
    September 5, 2014
  • Camera ready for workshop papers (workshop and task papers)
    submission deadline: September 12,

  • Workshop Day:
    October 25, 2014


The papers should be nine pages in length with an additional two
pages for references. Please refer to ACL format,
You can also download from below:

  1. Latex
    1. acl2014.tex
    2. acl2014.sty
    3. acl2014.pdf
    4. acl.bst
  2. MS-Word
    2. acl2014.pdf

Please follow this link to make a new submission:

Shared Task Paper Submission

Authors of participant systems are expected to submit a shared
task paper describing their system. The task papers should be 4
pages long + 1 page for references. If your team participates in
more than one language, and the systems are different, then you may
add up to 2 extra pages of content per system up to a maximum length
of 8 pages of content + up to 2 pages for references.

Submission system: We will use the same softconf submission
system used for the workshop papers. Please follow the link above
and log in with your START account.


To view the results please follow these links: Results of Twitter data, Results of surprise data.

Organizing Committee

  • Mona
  • Associate Professor
  • Department of Computer Science
  • George Washington University
  • [email protected]

Program Committee