Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora
Abstract
:1. Introduction
- provide harmonized L1 and L2 corpora with a clear focus on reusability of the data and
- ensure reproducible research results by integrating research workflows in L1 and L2 corpus research.
2. Materials and Methods
2.1. Data
2.1.1. Kolipsi-1 (L2)
2.1.2. Kolipsi-1 (L1)
2.1.3. Kolipsi-2
2.1.4. KoKo
2.1.5. Merlin
2.1.6. LEONIDE
2.2. Making L1 and L2 Corpus Data Available to the Research Community
2.3. Ensuring the Provision of FAIR Data
2.3.1. Findability
2.3.2. Accessibility
2.3.3. Interoperability
2.3.4. Reusability
- the background of the writers (e.g., L1, age, proficiency, other L2s)
- the writing task itself (target language, genre, writing prompts etc.)
3. Results
3.1. Base Components and Implementation of the Infrastructure
3.1.1. Porta—A Central Access Point for SSH Researchers
- Human-readable and unified documentationThe documentation presents the
- -
- background of the corpus and its corpus design
- -
- statistics on the corpus such as number of texts and tokens, number of writers, represented languages and language backgrounds of the writers
- -
- transcription and annotations guidelines,
- -
- annotation schemes used and description of corpus creation procedures
- Links to the FAIR data resources in a long-term archive (see Section 3.1.2 for details)
- A unified search interface to query the corpora directly( see Section 3.1.3 for details).
3.1.2. CLARIN-DSpace
3.1.3. ANNIS
3.2. Integrating L1 and L2 Learner Corpus Research Workflows for Reproducibility
3.2.1. Versioning of Corpora Using Git—The Case of Merlin
3.2.2. Processing Pipelines and Reproducibility—KoKo
3.3. Ensuring Comparability and Reusability through FAIRness of the Integrated L1 and L2 Corpora
3.3.1. Findability
3.3.2. Accessibility
3.3.3. Interoperability
3.3.4. Reusability
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Granger, S. Learner corpora in foreign language education. In Language and Technology. Encyclopedia of Language and Education; Thorne, S., May, S., Eds.; Springer: Cham, Switzerland, 2017; pp. 427–440. [Google Scholar] [CrossRef]
- Granger, S.; Hung, J.; Petch-Tyson, S. Computer Learner Corpora, Second Language Acquisition, and Foreign Language Teaching; John Benjamins Publishing: Amsterdam, The Netherlands, 2002; Volume 6. [Google Scholar]
- Schmidt, T. EXMARaLDA and the FOLK tools—Two toolsets for transcribing and annotating spoken language. In Proceedings of the Eighth Conference on International Language Resources and Evaluation (LREC’12); Declerck, T., Choukri, K., Calzolari, N., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2012; pp. 236–240. [Google Scholar]
- Kilgarriff, A.; Marcowitz, F.; Smith, S.; Thomas, J. Corpora and Language Learning with the Sketch Engine and SKELL. Rev. Fr. Linguist. Appl. 2015, XX, 61–80. [Google Scholar] [CrossRef]
- Krause, T.; Zeldes, A. ANNIS3: A new architecture for generic corpus query and visualization. In Digital Scholarship in the Humanities; The Oxford University Press: Oxford, UK, 2016; Volume 31, pp. 118–139. [Google Scholar] [CrossRef] [Green Version]
- Janssen, M. TEITOK: Text-faithful annotated corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016); Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., et al., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2016; pp. 4037–4043. [Google Scholar]
- Okinina, N.; Nicolas, L.; Lyding, V. Transc&Anno: A graphical tool for the transcription and on-the-fly annotation of handwritten documents. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018); Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., et al., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Volodina, E. Korp Searches in Second Language Data—Språkbanksbloggen. Available online: https://spraakbanken.gu.se/blogg/index.php/2020/06/17/korp-searches-in-second-language-data/ (accessed on 22 April 2021).
- Centre for English Corpus Linguistics. Learner Corpora around the World. 2020. Available online: https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html (accessed on 22 April 2021).
- Fišer, D.; Lenardič, J.; Erjavec, T. CLARIN’s key resource families. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018); Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., et al., Eds.; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Abel, A.; Glaznieks, A.; Nicolas, L.; Stemle, E.W. KoKo: An L1 learner corpus for german. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 2414–2421. [Google Scholar]
- Nesselhauf, N. Learner corpora and their potential for language teaching. In How to Use Corpora in Language Teaching; John Benjamins Publishing: Amsterdam, The Netherlands, 2004; Volume 12, pp. 125–156. [Google Scholar]
- Gilquin, G.; Granger, S. From design to collection of learner corpora. In The Cambridge Handbook of Learner Corpus Research; Cambridge University Press: Cambridge, UK, 2015; Volume 3, pp. 9–34. [Google Scholar]
- Hunston, S. Corpus compilation collection strategies and design decisions. In Corpus Linguistics: An International Handbook; De Gruyter Mouton: Berlin, Germany, 2009; Volume 2, pp. 154–168. [Google Scholar]
- Gries, S.T.; Newman, J. Creating and using corpora. Res. Methods Linguist. 2014, 2, 257–287. [Google Scholar] [CrossRef]
- Lenardič, J.; Tiedemann, T.L.; Fišer, D. Overview of L2 Corpora and Re-Sources 2.0; Technical Report; CLARIN: Utrecht, The Netherlands, 2018. [Google Scholar]
- Gries, S. On over- and underuse in learner corpus research and multifactoriality in corpus linguistics more generally. J. Second Lang. Stud. 2018, 1, 276–308. [Google Scholar] [CrossRef]
- Paquot, M.; Plonsky, L. Quantitative research methods and study quality in learner corpus research. Int. J. Learn. Corpus Res. 2017, 3, 61–94. [Google Scholar] [CrossRef]
- Volodina, E.; Tenfjord, K.; Mikelic Preradovic, N.; Janssen, M.; Lindström Tiedemann, T.; Ragnhildstveit, S. Workshop on Interoperability of L2 Resources and Tools | Sweclarin.se. Available online: https://sweclarin.se/swe/workshop-interoperability-l2-resources-and-tools,2017 (accessed on 22 April 2021).
- Stemle, E.W.; Boyd, A.; Janssen, M.; Lindström Tiedemann, T.; Mikelić Preradović, N.; Rosen, A.; Rosén, D.; Volodina, E. Working together towards an ideal infrastructure for language learner corpora. In Proceedings of the Widening the Scope of Learner Corpus Research Selected Papers from the Fourth Learner Corpus Research Conference 2017, Bolzano/Bozen, Italy, 5–7 October 2017; pp. 437–478. [Google Scholar]
- Volodina, E.; Megyesi, B.; Wirén, M.; Granstedt, L.; Prentice, J.; Reichenberg, M.; Sundberg, G. A friend in need?: Research agenda for electronic Second Language infrastructure. In Proceedings of the Sixth Swedish Language Technology Conference (SLTC), Umeå, Sweden, 17–18 November 2016. [Google Scholar]
- Glaznieks, A.; Abel, A.; Lyding, V.; Nicolas, L.; Stemle, E.W. Establishing a Standardised Procedure for Building Learner Corpora. Apples J. Appl. Lang. Stud. 2014, 8, 5–20. [Google Scholar]
- Treloar, A. The Research Data Alliance: Globally co-ordinated action against barriers to data publishing and sharing. Learn. Publ. 2014, 27, 9–13. [Google Scholar] [CrossRef] [Green Version]
- Moskovko, M. Intensified role of the European Union? European Research Infrastructure Consortium as a legal framework for contemporary multinational research collaboration. In Big Science and Research Infrastructures in Europe; Edward Elgar Publishing: Surrey, UK, 2020. [Google Scholar]
- Ayris, P.; Berthou, J.Y.; Bruce, R.; Lindstaedt, S.; Monreale, A.; Mons, B.; Murayama, Y.; Södergård, C.; Tochtermann, K.; Wilkinson, R. Realising the European Open Science Cloud. First Report and Recommendations of the Commission High Level Expert Group on the European Open Science Cloud. 2016. Available online: file:///C:/Users/MDPI/AppData/Local/Temp/RealisingtheOpenScienceCloud-2.pdf (accessed on 22 April 2021).
- Veršić, I.I.; Ausserhofer, J. Social sciences, humanities and their interoperability with the European Open Science Cloud: What is SSHOC? Mitt. Ver. Österreichischer Bibl. Bibl. 2019, 72, 383–391. [Google Scholar]
- European Language Resources Association (ELRA). Social Sciences and Humanities Pathway Towards the European Open Science Cloud; European Language Resources Association (ELRA): Paris, France, 2020. [Google Scholar] [CrossRef]
- Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 1–9. [Google Scholar] [CrossRef] [Green Version]
- Barbot, L.; Biller, T.; Broeder, D.; Dekker, R.; Durco, M.; Vipavc, I.; Willems, M. Agile development of the SSH open marketplace: User workshop. In ITM Web of Conferences; EDP Sciences: Ulis, Paris, 2020; Volume 33, p. 04001. [Google Scholar]
- de Jong, F.M.G.; Maegaard, B.; De Smedt, K.; Fišer, D.; Van Uytvanck, D. CLARIN: Towards FAIR and responsible data science using language resources. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Abel, A.; Vettori, C.; Wisniewski, K. KOLIPSI: Gli Studenti Altoatesini e la Seconda Lingua; Indagine Linguistica e Psicosociale= KOLIPSI: Die Südtiroler SchülerInnen und die Zweitsprache; Eine Linguistische und Sozialpsychologische Untersuchung; Eurac Research: Bolzano/Bozen, 2012; Available online: https://www.researchgate.net/publication/259453091_Gli_studenti_altoatesini_e_la_seconda_lingua_indagine_linguistica_e_psicosociale_Die_Sudtiroler_SchulerInnen_und_die_Zweitsprache_eine_linguistische_und_sozialpsychologische_Untersuchung_Volume_1_-_Ba (accessed on 22 April 2021).
- Vettori, C.; Abel, A. (Eds.) KOLIPSI II Gli studenti altoatesini e la seconda lingua: Indagine linguistica e psicosociale. In Die Südtiroler SchülerInnen und die Zweitsprache: Eine Linguistische und Sozialpsychologische Untersuchung; Eurac Research: Bolzano/Bozen, 2017; Available online: https://bia.unibz.it/discovery/delivery?vid=39UBZ_INST:ResearchRepository&repId=12235320180001241#13235268510001241 (accessed on 22 April 2021).
- Abel, A.; Glaznieks, A.; Nicolas, L.; Stemle, E. An extended version of the KoKo German L1 Learner corpus. In Proceedings of the Third Italian Conference on Computational Linguistics, Napoli, Italy, 5–6 December 2016. [Google Scholar]
- Boyd, A.; Hana, J.; Nicolas, L.; Meurers, D.; Wisniewski, K.; Abel, A.; Schöne, K.; Štindlová, B.; Vettori, C. The MERLIN corpus: Learner language and the CEFR. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 14), Reykjavik, Iceland, 26–31 May 2014; pp. 1281–1288. [Google Scholar]
- Zanasi, L.; Stopfner, M. Rilevare, osservare, consultare. Metodi e strumenti per l’analisi del plurilinguismo nella scuola secondaria di primo grado. La Didattica Delle Lingue nel Nuovo Millennio 2018, 135–148. Available online: https://edizionicafoscari.unive.it/media/pdf/books/978-88-6969-228-4/978-88-6969-228-4-ch-01_ALK6Jr7.pdf (accessed on 22 April 2021).
- Granger, S.; Dagneaux, E.; Meunier, F.; Paquot, M. International Corpus of Learner English; Presses Universitaires de Louvain: Louvain-la-Neuve, Belgium, 2009. [Google Scholar]
- Tenfjord, K.; Meurer, P.; Hofland, K. The ASK corpus-a language learner corpus of norwegian as a second language. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 22–28 May 2006; pp. 1821–1824. [Google Scholar]
- Rosen, A.; Hana, J.; Vidová Hladká, B.; Jelínek, T.; Škodová, S.; Štindlová, B. Compiling and Annotating a Learner Corpus for a Morphologically Rich Language: {CzeSL}, a Corpus of Non-Native {Czech}; Nakladatelství Karolinum, 2020; Available online: http://hdl.handle.net/20.500.11956/123103 (accessed on 22 April 2021).
- Blanchard, D.; Tetreault, J.; Higgins, D.; Cahill, A.; Chodorow, M. TOEFL11: A corpus of non-native English. ETS Res. Rep. Ser. 2013, 2, 15. [Google Scholar] [CrossRef]
- Mons, B.; Neylon, C.; Velterop, J.; Dumontier, M.; da Silva Santos, L.O.B.; Wilkinson, M.D. Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Inf. Serv. Use 2017, 37, 49–56. [Google Scholar] [CrossRef] [Green Version]
- Lindström, T.; Lenardič, J.; Fišer, D. L2 learner corpus survey–Towards improved verifiability, reproducibility and inspiration in learner corpus research. In Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy, 8–10 October 2018; pp. 146–150. [Google Scholar]
- Van Uytvanck, D.; Stehouwer, H.; Lampen, L. Semantic metadata mapping in practice: The Virtual Language Observatory. In LREC 2012: 8th International Conference on Language Resources and Evaluation; European Language Resources Association (ELRA): Paris, France, 2012; pp. 1029–1034. [Google Scholar]
- Megyesi, B.; Granstedt, L.; Johansson, S.; Prentice, J.; Rosén, D.; Schenström, C.J.; Sundberg, G.; Wirén, M.; Volodina, E. Learner corpus anonymization in the age of gdpr: Insights from the creation of a learner corpus of swedish. In Proceedings of the 7th Workshop on NLP for Computer Assisted Language Learning (NLP4CALL 2018) at SLTC, Stockholm, Sweden, 7 November 2018; Linköping University Electronic Press: Linköping, Sweden, 2018; pp. 47–56. [Google Scholar]
- Volodina, E.; Janssen, M.; Tiedemann, T.L.; Preradović, N.M.; Ragnhildstveit, S.; Tenfjord, K.; de Smedt, K. Interoperability of Second Language Resources and Tools. In Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy, 8–10 October 2018; pp. 90–94. [Google Scholar]
- Chiarcos, C.; Nordhoff, S.; Hellmann, S. Linked Data in Linguistics; Springer: New York, NY, USA, 2012. [Google Scholar]
- Granger, S.; Paquot, M. Towards standardization of metadata for L2 corpora. In Proceedings of the workshop on Interoperability of Second Language Resources and Tools, Gothenburg, Sweden, 6–8 December 2017. [Google Scholar]
- Wittenburg, P.; Van Uytvanck, D.; Zastrow, T.; Strak, P.; Broeder, D.; Schiel, F.; Boehlke, V.; Reichel, U.; Offersgaard, L. CLARIN B Centre Checklist; Technical Report CE-2013-0095; Clarin Eric: Utrecht, The Netherlands, 2018. [Google Scholar]
- Eskevich, M.; de Jong, F.; König, A.; Fišer, D.; Van Uytvanck, D.; Aalto, T.; Borin, L.; Gerassimenko, O.; Hajic, J.; van den Heuvel, H.; et al. CLARIN: Distributed language resources and technology in a European infrastructure. In Proceedings of the 1st International Workshop on Language Technology Platforms; European Language Resources Association (ELRA): Marseille, France, 2020; pp. 28–34. [Google Scholar]
- Nicolas, L.; Stemle, E.; Glaznieks, A.; Abel, A. A Generic Data Workflow for Building Annotated Text Corpora. Stud. Learn. Corpus Linguist. Res. Appl. Foreign Lang. Teach. Assess. 2015, 190, 337–351. [Google Scholar] [CrossRef]
- Abel, A.; Anstein, S. Korpus südtirol—Varietätenlinguistische untersuchungen. In Korpora in Lehre und Forschung; Abel, A., Zanin, R., Eds.; Bozen-Bolzano University Press: Bozen, Italy, 2011; pp. 29–54. [Google Scholar]
- Schmid, H. Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop, Dublin, Ireland, 30 June 1995; pp. 47–50. [Google Scholar]
- Evert, S.; Hardie, A. Twenty-first century Corpus Workbench: Updating a query architecture for the new millennium. In Proceedings of the Corpus Linguistics 2011, Birmingham, UK, 20–22 July 2011. [Google Scholar]
- Rychlý, P. Manatee/Bonito—A modular corpus manager. In First Workshop on Recent Advances in Slavonic Natural Language Processing (RASLAN 2007); Masaryk University: Brno, Czech Republic, 2007; pp. 65–70. [Google Scholar]
- König, A.; Stemle, E.W.; Moreira, A.; Elbers, W. Technical solutions for reproducible research. In Selected Papers from the CLARIN Annual Conference 2019; Simov, K., Eskevich, M., Eds.; Linköping University Electronic Press: Linköping, Sweden, 2020; Volume 172, pp. 66–74. [Google Scholar] [CrossRef]
- Branco, A.; Calzolari, N.; Vossen, P.; Van Noord, G.; Van Uytvanck, D.; Silva, J.; Gomes, L.; Moreira, A.; Elbers, W. A Shared Task of a New, Collaborative Type to foster Reproducibility: A first exercise in the area of language science and technology with REPROLANG2020. In Proceedings of the 12th Language Resources and Evaluation Conference; European Language Resources Association: Paris, France, 2020; pp. 5539–5545. [Google Scholar]
- Krauwer, S.; Hinrichs, E. The CLARIN research infrastructure: Resources and tools for e-humanities scholars. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014); European Language Resources Association: Paris, France, 2014; pp. 1525–1531. [Google Scholar]
- Druskat, S.; Gast, V.; Krause, T.; Zipser, F. Corpus-tools. org: An interoperable generic software tool set for multi-layer linguistic corpora. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16); European Language Resources Association: Paris, France, 2016; pp. 4492–4499. [Google Scholar]
- Broeder, D.; Windhouwer, M.; Van Uytvanck, D.; Goosen, T.; Trippel, T. CMDI: A component metadata infrastructure. In Proceedings of the Workshop on Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR, Istanbul, Turkey, 22 May 2012. [Google Scholar]
- Granger, S.; Paquot, M. Core metadata for learner corpora: Eraft 1.0. In Proceedings of the workshop on Interoperability of Second Language Resources and Tools, Gothenburg, Sweden, 6–8 December 2017. [Google Scholar]
- Piperidis, S. The META-SHARE language resources sharing infrastructure: Principles, challenges, solutions. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC), Istanbul, Turkey, 21–27 May 2012; pp. 36–42. [Google Scholar]
- Alfter, D.; Borin, L.; Pilán, I.; Tiedemann, T.L.; Volodina, E. Lärka: From language learning platform to infrastructure for research on language learning. In Proceedings of the CLARIN Annual Conference 2018, Pisa, Italy, 8–10 October 2018; pp. 53–56. [Google Scholar]
- Darģis, R.; Auziņa, I.; Levāne-Petrova, K.; Kaija, I. Quality focused approach to a learner corpus development. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 392–396. [Google Scholar]
Corpus | Size in Texts | Text Language | Data Collection Time |
---|---|---|---|
Kolipsi-1_L2 | ca. 2 500 | German, Italian (L2) | 2007 |
Kolipsi-1_L1 | ca. 500 | German, Italian (L1) | 2010 |
Kolipsi-2 | ca. 2 500 | German, Italian (L2) | 2014 |
KoKo | ca. 1 500 | German (L1) | 2011 |
LEONIDE | ca. 2 500 | German, Italian, English (L1, L2, L3) | 2015-2018 |
Merlin | ca. 2 300 | German, Italian, Czech (L2) | 2012 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
König, A.; Frey, J.-C.; Stemle, E.W. Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information 2021, 12, 199. https://doi.org/10.3390/info12050199
König A, Frey J-C, Stemle EW. Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information. 2021; 12(5):199. https://doi.org/10.3390/info12050199
Chicago/Turabian StyleKönig, Alexander, Jennifer-Carmen Frey, and Egon W. Stemle. 2021. "Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora" Information 12, no. 5: 199. https://doi.org/10.3390/info12050199
APA StyleKönig, A., Frey, J. -C., & Stemle, E. W. (2021). Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora. Information, 12(5), 199. https://doi.org/10.3390/info12050199