COST Action CA16105 - enetCollect : European Network for Combining Language Learning with Crowdsourcing Techniques
  • Login

Breadcrumb Navigation

Icon Wiki


This wiki page contains the relevant info for the Crowdfest to be held in January 2018



The first enetCollect Crowdfest was quite a success!

More than 40 people got together to address 6 different tasks in 2 and 1/2 days (Brussels, Cost Action offices, 23-25 January) with the aim creating and developing some real prototypes, learning resources or frameworks that could be useful for language learning practitioners in general and enetCollect members in particular. As a result, we hope that the Crowfest to be a good start in order to forge new collaborations and research ideas and experiments within enetCollect.

In the following, you can see the materials generated for each of the tasks as well as some photos of the event.

Task 1. Quest Game or "Katana and Grand Guru: A Game of the Lost Words"

- Slides:

Task 2: Question Generation for Reading Comprehension

- Slides:

Task 3. Prototyping a vocabulary trainer with implicit crowdsourcing of Language Resources (LRs)

- Slides:
- Game (Telegram bot version):
- Game (Web interface):

Task 4. Exploration of scenarios of implicit crowdsourcing for language learning

- Slides:
- Prototype presi output:

Task 5. Crowdsourcing corpus cleaning for language learning

- Slides:
- Diagram:
- Example of corpus cleaning:

Task 6. Business models in language learning platforms

- Slides:

Call for Interest

A Crowdfest is being organized within the action. The idea is to create/develop/adapt some real prototype, learning resource, platform to create the learning resource, etc., that could be useful for language learning practitioners (teachers and learners, mostly). In 48 hours :) At a place, in teams consisting of people with the various profiles in the action: language learning experts, crowdsourcing experts, linguists, programmers, nlpers etc. and so on. Everybody is welcome.

The overall and vague structure of an idea or use case could be framed as: "We are going to crowdsource I some data X using some plaftorm and then the data will be used to semi-automatically create a Y tool that will help teachers to do a Z task which so far is manually done." 

Of course you can be original and do not follow this and propose whatever you *really* would like to see done (or try to do).

Come on! It will be fun!

Organizing Committee: Rodrigo Agerri, Branislav Bédi, Karën Fort, Verena Lyding, Lionel Nicolas, Toma Tasovac.


Inscription or expression of interest via online form:

Deadline: 7th of December

There is funding for around 20 people to attend (travel and expenses covered). 

Selection criteria:
1. First come first serve basis (primary).
2. Profiles required for the tasks (secondary).


Tentative structure:

2 1/2 days
Location: Brussels
Dates: 23-25 January 2019

Day 1 (starting around 9am):
  • Presentation of each of the tasks to be developed during the Crowdfest. Main objective, data, tools and personnel required. (10 minutes each)
  • Setting up of groups (balanced in terms of expertise) (colour badges per expertise).
  • Hack/crowdsource away (lunch in place if possible)
  • End of the day: collective informal dinner/drinks
Day 2 (all day):
  • 5 min presentation of progress and problems encountered during the previous day's work.
  • Hack/crowdsource away (lunch in place)
  • End of the day: collective informal dinner/drinks
Day 3 (closing by 1pm):
  • Prototype presentations, demos, problems, conclusions (30 minutes each, questions included).
  • Discussion: future work. Where do we go from here? Collaborative research...
  • Voting (per person)
  • Final act, pictures and so on so forth
  • Lunch/drinks

Ideas for Tasks

1. Quest Game

Specifying (with drawings, diagrams, whatever) a quest game to foster inter-generational exchanges in/on a language between grand-parents and children (esp. for non-standardized languages or disappearing languages). Tips are given in the language and the kids need to ask their parents/grand parents about it to be able to progress in the game (and rescue the prince). Crowdsourcing part: some questions imply written/spoken answers. They can include asking for synonyms or finding elements which should not belong in a set (like a noun in a set of verbs). Proposed by Karën Fort

2. Question Generation and its evaluation via Crowdsourcing:

The main goal of our proposal is to create an  interdisciplinary community around the generation of pedagogical questions that should be relevant in the task of reading comprehension, grammatical constructions, etc. In order to do that, we propose a task focused on question generation. Participants will have the opportunity to focus on different aspects of the generation task, for which we will provide with one or two automatic  generation systems (QG) that, given a text, would produce some questions automatically. We will also distribute the input texts required to do so. Participants could:
  • Work on the input text in order to select relevant sentence(s) to generate the questions.
  • Work on the generated questions and on defining and applying some type of post-process in order to tune the generation process by choosing the best ones. 
  • Work on the improvement of the QG system. We will distribute the code of (at least) one of the QG systems.
  • Evaluate questions via Crowdsourcing: given a pool of questions, evaluate them according to guidelines provided by the organizers. The aim will be to measure relevance, appropriateness etc. 
  • Profile of participants: Language Learners, teachers, NLPers, programmers.
Proposed by Itziar Aldabe, Andrea Horbach, Oier Lopez de la Calle, Montse Maritxalar.

3. Vocabulary trainer with implicit crowdsourcing of language Resources (LRs)

The overall aim is to design and implement a vocabulary trainer for language learners, which provides interactive vocabulary exercises like "fill the gap", "select all verbs among the given words", while crowdsourcing the learners’ answers. This vocabulary trainer will build on existing mono- and bilingual corpora or lexicons to generate exercise content. For example, a corpus can be used to create fill-the-gap exercises or a lexicon can be used to create exercises related to grammatical categories of words (e.g. gender, plural forms, etc.). The collection of learners’ answers will be used to extend or improve the language resources that the exercises are built from (e.g. by asking the user to choose the grammatical category of any word within a corpus that is not part of the lexicon, new lexicon entries can be created).

Specific sub-tasks to be developed:

Subtask 3.1: Generating exercises from language resources
Develop a formal procedure for automatically generating vocabulary exercises from corpora or lexica. Several exercise types, which should be generated, will be defined prior to the crowdfest.
Profile of participants: Members with experience and expertise about generating language-related datasets

Subtask 3.2: Design the user interface for vocabulary learning exercises
This subtask requires to think about how information should be placed in the interface, which interactions should be possible for the user, what feedback should be provided to the user, etc. (this could be done with pen and paper or with a platform for interface design)
Profile of participants: Members with experience and expertise about language learning interfaces and members with web programming skills

Subtask 3.3: Develop a validation mechanism to cross-match learner’s answers in order to create new resource entries (this task is strongly linked to WG4 objectives and could be led by WG4): This subtask requires to experiment with different approaches for weighing answers and finding the best answer (considering variables, like 'size of the set of answers', 'diversity of proficiency of learners', etc.).
Profile of participants:  Members with expertise in crowdsourcing technologies and programming

Subtask 3.4: Implement a web service for the interaction between the language resources/validation algorithm and the user interface
This subtask requires to set up a web service for sending automatically generated exercise content from the back-end to the interface and for sending the learners’ answers from the interface to the validation algorithm/language resource and to send the feedback of the evaluation back to the user interface.
Profile of participants: Members with experience and expertise in software architecture and web services.

Subtask 3.5: Commonsense knowledge from popular ontologies
This subtask requires to experiment with crowdsourced commonsense knowledge databases and ontologies, such as Concept Net, Wikidata, DBpedia or YAGO to create exercises for learners. For example we could use ConceptNet to search for terms that are "related" or "located" [or other type of relations] to a term and create short sentences that are missing that word, e.g., The following things [chair, pencil, stapler] are located at ___X__ . What __X___ could be? [X is a desk]. When the learner finds the correct answer, we can ask what else is located at X (or a similar type of relation). We will capture these answers and repopulate our knowledge base and of course the corpus we used. Other ideas include expanding to more languages, since these commonsense knowledge bases contain translations and synonyms for other languages as well or combining them with NLP to get POS, plural, etc.. (The possibilities are endless).
People interested can have a look at the following links:
Profile of participants:  Members with expertise in crowdsourcing technologies, language experts, experts in APIs and programming
Proposed by Verena Lyding, Lionel Nicolas and Alexander König (subtask 3.5 proposed by Christos Rodosthenous and Nikos Isaak)

4. Exploration of scenarios of implicit crowdsourcing for language learning

This task aims to create a first set of blueprints explaining how to enhance some language resources (e.g. NLP resources) by implicitly crowdsourcing learner answers when performing exercises. It is a brainstorming/exploration task with the aim to think through and specify a set of implicit crowdsourcing blueprints of relevance for the different WG2 members.

The overall objective of the task relies on an implicit crowdsourcing concept of WG2, which starts from the idea that:
-> IF a language resource can be used to generate exercise content,
-> THEN some of the language learners' answers can be cross-matched and used to correct and extend the language resource used to generate the exercise content.

In fact, the scenarios which we will work on will consist in (1) pairing a type of language resources (e.g. lexicons, thesauri, bilingual corpora) and a type of existing language learning exercise, and explaining, for each pair of language resource and exercise, (2) why we can generate content for the exercise from the language resource and (3) how we can use the learners' answers to enhance the language resource. We will also intend to identify on the one hand stakeholders (if not ourselves) curating language-related datasets and, on the other hand, language learning platforms offering these exercises for some languages.

If things go as expected, we would come up with scenarios where:
- a given type of language-related dataset is paired with a set of exercises,
- (optional) we have identified language learning platforms offering this exercises for some languages,
- (optional) we have identified satkeholders curating language-related datasets that could be interested in implementing the scenarios.

Example of scenarios
Lexica with part-of-speech (POS)  information describing the morpho-syntactic categories of words are widely used in NLP and available in many languages. Such NLP resources contain the information needed to generate questions for classic language learning exercise such as the ones asking to categozies verbs/adjectives/nouns/etc. (click here for an example) or exercises that provide a grid of letters and ask learners to select words with certain syntactic category in it (click here for an example).

For the first type of exercise mentioned, the ones asking to categorize verbs/adjectives/nouns/etc, if learners tend to categorize a word with a POS that is not described in the lexicon used to generate the content of the exercise, then one possibility is that they are simply making mistakes and do not know well this word. However, if they keep on making mistakes for this word more than for the rest of the words, it could also means that the lexicon is missing an homonym with the POS they indicate for that word (i.e. the lexicon is missing an entry for this word). We could thus define a scenario where such combination of language learning exercises and POS lexica can allow to generate content for the exercises on one hand and to crowdsource corrections for the NLP lexica on the other hand.

For the second type of exercise, the ones that provide a grid of letters and ask learners to select words with certain syntactic category in it, we could generate a grid that contains words belonging to a syntactic category and ask learners to select words to find them in the grid (e.g five adjectives). At the same time, we could add other words that are not yet in the lexica we are using to generate the grid of letters (e.g. a neologism appearing in a newspaper feed, "Flabbergasted" ) and see if students select them. If they do it recurrently enough, this would be a good indication that the word needs to be added to the lexica, with the POS that the learners keep on selecting them when being instructed to find words of this syntactic category (e.g. adjective). We could thus define a scenario where such combination of language learning exercises and POS lexica can allow to generate content for the exercises on one hand and to crowdsource new entries for the NLP lexica on the other hand.

Potential preparation prior to the crowdfest
Prior to the crowdfest, a listing of classic language learning exercises could be gathered prior to the crowdfest. We could also reuse the summaries of the state-of-the-art effort made by WG1 members via the Zotero tool to identify papers that performed a crowdsourcing experiment via Amazon Mechanical turk (or equivalent) and detect those that could also have been performed via a scenario combining implicit crowdsourcing  and language learning.

Profile of participants
The team for this task should be composed of 3-5 enetCollect members, including 2-3 WG2 members and 1-2 WG3 members.

Proposed by Lionel Nicolas and Verena Lyding.

5. Spin-off ideas from WG1 hands-on meeting

Corpora are a great source for the development of language learning resources. However, teachers) usually preferred that the data obtained be checked and pre-approved/filtered before actual pedagogical use so that they do not contain “problematic” issues (both from the perspective of correctness as well as from the perspective of content). One automatically-created language learning resource that is based on “checked” corpora is SkeLL (, Sketch Engine for Language Learning, where texts containing the s.c. PARSNIPs (Pork - Alcohol - Racism - Sex - Narcotics - Isms - Politics) have been excluded from the corpora. In previous projects, the filtering of such sensitive words (and by extent, content) was done automatically with the use of predefined seedwords. However, this approach removes a lot of data in a somewhat non-controlled way, while on the other hand many problems remain untackled (or even unidentified).

One of the spin-off ideas from WG1 hands-on meeting in Gothenburg was thus to use crowdsourcing to support the cleaning up of a corpus so that it can be used for language learning purposes. The pilot case is the cleaning up of a 3.8-billion-word corpus of Portuguese in order to develop SkELL for Portuguese (Tanara Zingaro Kuhn). The proposed model is language independent and we already know that other languages in the network have interest and, more importantly, the right conditions to apply it. In Gothenburg, we have prepared a workflow for the task that includes developing the gold-standard for the Portuguese corpus followed by the use of Pybossa with the general public, which involves collecting evaluations on the suitability of sentences for pedagogical purposes. Furthermore, the crowdsourced data can be used to train a system for automatic filtering of examples, while additional research on how “the crowd” was separating acceptable examples from the non-acceptable will help us gain a better insight into the criteria applied (that have to be further addressed, as the desired “corpus censorship” needs to be well understood and applied with care).

Link to the WG1 meeting:

Proposed by Tanara Zingano Kuhn (teacher, language learning, corpus linguist) and Peter Dekker (NLP) (with the great contribution of Špela Arhar and Margaret Bielenia).

6. Choice of a business model for the language-learning crowdsourcing platform
The main goal is to analyze existing business models used by successful crowdsourcing platforms and to discuss their advantages and disadvantages, funding, start-up history, dynamic characteristics like the growth of the number of users. An outcome of the discussion should be a definition/description of a business model that would be suitable for a language-learning crowdsourcing platform, considering its funding prospects and commercialization opportunities.
Participants: anyone interested in start-ups with some entrepreneurship and business experience.

Proposed by Nina Gorovaia


  1. Minutes of the First Core Group meeting:

Last edited: 04. Feb 2019, 12:33, Nicolas, Lionel [lnicolas]

No comment has been posted yet.

Search (Block)

Wiki Functions (Block)


Recent Changes