EMNLP 2014: Conference on Empirical Methods in Natural Language Processing — October 25–29, 2014 — Doha, Qatar.


Modeling Large Scale Social Interaction in Massively Open Online Courses Workshop (EMNLP 2014)


Research on Massively Open Online Courses (MOOCs) is an emerging area for real world impact of technology for analysis of social media at a large scale. The goal of this workshop is to explore what the language technologies community has to offer this endeavor. At this one day workshop, organized around a shared task related to analysis of large scale social interaction in MOOCs, we will evaluate the competing images of the inner workings of large scale learning communities provided by alternative computational approaches. The workshop will address leading research through keynote talks, presentations of research papers, posters, and demos, and especially a shared task.

With the recent press given to online education and increasing enrollment in online courses, the need for scaling up quality educational experiences online has never been more urgent. Current offerings provide excellent materials including video lectures, exercises, and some forms of discussion. One important hurdle that prevents MOOCs from reaching their full potential is that they fail to provide the kind of social environment that is conducive to sustained engagement and learning. While limited, current affordances for social interaction in MOOCs have already shown some value for providing students with connection to others. These connections provide needed motivation, support, encouragement, while increasing persistence. With technology developed for social media analysis, can we as language technologies researchers offer insights that would inform improved design of such environments?

MOOCs are especially interesting as a source of large scale social data. The unique developmental history of MOOCs creates challenges that require insight into the inner-workings of massive scale social interaction. In particular, rather than evolving gradually as better understood forms of online communities, once MOOCs are launched, they spring up rapidly and then expand as new cohorts of students arrive from week to week to begin the course. These massive gatherings of strangers lack shared practices that would enable them to form supportive bonds of interaction or community. While some students may successfully find affinity with small groups, when others come they may find an overwhelming amount of communication having already been posted, resulting in learner feelings of isolation. Others may find themselves somewhere in between these two extremes. They may begin to form weak bonds with some other students when they join. However, massive attrition may create challenges as members who have begun to form bonds with fellow students soon find their virtual cohort dwindling. Early attempts to organize the community into smaller study groups may be thwarted by periodic growth spurts paired with attrition, as groups that initially had an appropriate critical mass soon fall below that level and then are unable to support the needs of remaining students. Can our models serve as useful lenses to offer insights into these social processes?


As a resource, a data set from a particularly innovative form of MOOC referred to as a cMOOC will be made publically available along with some additional resources. The additional resources and the Intent to Participate/Data Request form are available at the following webpage: http://www.cs.cmu.edu/~cprose/MOOC-Resources.html. The supplementary data set will not formally be part of the shared task. However, we invite research contributions analyzing this data under Research papers, posters, and demos. What makes the dataset particularly interesting is that is consists of communication between 2,000 students in a variety of different social media channels. The data has already been analyzed qualitatively. We offer workshop participants the opportunity to add insight into what the fields of natural language processing and machine learning have to offer this work.

Research papers, posters, and demos

Research papers, posters, and demos should focus on analysis of MOOC data or demonstration of interventions that were deployed or could be deployed in a MOOC context. Submissions will be of a technical nature, but they will be reviewed in combination by researchers from the technical side as well as the behavioral research side. The purpose is to engender a meaningful exchange between communities.

Submissions should describe original, unpublished work. Each long paper submission consists of a paper of up to nine (9) pages of content and any number of additional pages containing references only. Short papers will be presented orally or as a poster (at the discretion of the program chairs), and will be given four (4) pages plus 2 pages for references in the proceedings. Demo submissions should follow the same format as short papers. Each paper or demo submission will be reviewed by at least two program committee members. Both long and short papers should follow the two-column format of ACL 2014 proceedings. Please use the official ACL 2014 style files for the paper (and ensure that your paper is A4 size). We reserve the right to reject submissions if the paper does not conform to these styles, including letter size and font size restrictions. Reviewing will not be blind.

Shared Task

The workshop will be organized around a shared task, which will be analysis of data extracted from 6 Coursera MOOCs. Data from one MOOC with approximately 30K students will be distributed as training data.

Important update!! Some people have been asking for clarification on the human subjects training requirement on gaining access to the data. Clarification has been added to the Intent to Participate form . Note that the course is short and you do not have to pay for it!!

Due to Institutional Review Board restrictions, we are not able to distribute the data from the 5 test MOOCs. Instead, we will run the predictive models participants provide on the 5 test MOOCs and report the results at the workshop. The prediction task will be Predicting Attrition Along the Way. Based on behavioral data from a week's worth of activity in a MOOC for a student, predict whether the student will cease to actively participate after that week. Performance will be computed based on Percent Accuracy and Cohen's Kappa.

In a typical MOOC, between 5% and 10% of students actively participate in the threaded discussion forums. Previously published research demonstrates that characteristics of posting behavior are predictive of dropout along the way (Rosé et al., 2014; Wen et al., 2014a; Wen et al., 2014b; Yang et al., 2013; Yang et al., 2014). However, ideally, we would like to make predictions for the other 90% to 95% of students who don't post. Thus, in this shared task, we challenge participants to use the text from the minority of students who participate in the discussion forums to make meaning from the clickstream data so that a more meaningful prediction can also be made about the students who do not post to the discussion forums. We recommend participants to make use of the text data to bootstrap effective models that use only clickstream data. However, participants are welcome to leverage either type of data in the models they submit. They should be aware that two different evaluations will be conducted over the test data from the training MOOC as well as the 5 test MOOCs: First, an evaluation will be conducted on data from students who actively participate in the discussion forums. Second, an evaluation will be conducted on data from students who never participated in the discussion forums. And finally, and evaluation will be conducted on the set of students that includes both types of students.

Each submission will consist of a write up describing the technical approach and a link to a downloadable zip file containing the trained model and code and/or a script for using the trained model to make predictions about the test sets. The code must be runnable by launching a single script in Ubuntu 12.04. The following programming languages are acceptable: R 3.1, C++ 4.7, Java 1.6, or Python 2.7. The script must run within 24 hours on a machine with 6 cores. Some exceptions will be made by special request to the workshop organizers. Write up submissions should fully describe the technical approach used as well as any evaluation of the technical approach conducted by the authors, and should be in cameral ready condition. Each write up consists of a paper of up to six (6) pages of content and any number of additional pages containing references only. The paper should follow the two-column format of ACL 2014 proceedings. Please use the official ACL 2014 style files for the paper (and ensure that your paper is A4 size). We reserve the right to reject submissions if the paper does not conform to these styles, including letter size and font size restrictions. Reviewing will not be blind.

Important Dates

All communication with organizers should be through emnlpmooc@gmail.com




Carolyn Penstein Rosé

Language Technologies Institute/ Human-Computer Interaction Institute, School of Computer Science, Carnegie Mellon University


George Siemens

Associate Director of Technology Enhanced Research Institute at Athabasca University