Skip to content

Documentation associated with the CorCenCC project

Notifications You must be signed in to change notification settings

CorCenCC/documentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

CorCenCC project overview and how to cite

CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh) is an inter-disciplinary and multi-institutional project that has created a large-scale, open-source corpus of contemporary Welsh. The CorCenCC corpus contains over 11 million words (circa 14 million tokens) from written, spoken and electronic (online, digital texts) Welsh language sources, taken from a range of genres, language varieties (regional and social) and contexts. The contributors to CorCenCC are representative of the over half a million Welsh speakers in the country.

Information on how to request access to the CorCenCC dataset is available here: www.corcencc.org/download

The CorCenCC corpus and associated software tools are licensed under Creative Commons CC-BY-SA v4 and thus are freely available for use by professional communities and individuals with an interest in language. Bespoke applications and instructions are provided for each tool. When reporting information derived by using the CorCenCC corpus data and/or tools, CorCenCC should be appropriately acknowledged, as follows:

  • CorCenCC corpus: Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., Thomas, E-M., Lovell, A., Morris, J., Evas, J., Stonelake, M., Arman, L., Davies, J., Ezeani, I., Neale, S., Needs, J., Piao, S., Rees, M., Watkins, G., Williams, L., Muralidaran, V., Tovey-Walsh, B., Anthony, L., Cobb, T., Deuchar, M., Donnelly, K., McCarthy, M. and Scannell, K. (2020). CorCenCC: Corpws Cenedlaethol Cymraeg Cyfoes – the National Corpus of Contemporary Welsh. Cardiff University. http://doi.org/10.17035/d.2020.0119878310

  • Report: Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I. and Thomas, E. M. (2020). The National Corpus of Contemporary Welsh: Project Report | Y Corpws Cenedlaethol Cymraeg Cyfoes: Adroddiad y Prosiect. arXiv:2010.05542, October 2020. Available online at: https://arxiv.org/abs/2010.05542 (also see www.corcencc.org/outputs for a PDF version of this report)

  • CorCenCC’s infrastructure and crowdsourcing app: Knight, D., Loizides, F., Neale, S., Anthony, L. and Spasić, I. (2020). Developing computational infrastructure for the CorCenCC corpus – the National Corpus of Contemporary Welsh. Language Resources and Evaluation (LREV). https://doi.org/10.1007/s10579-020-09501-9

  • CorCenCC’s part-of-speech (POS) tagger ‘CyTag’: Neale, S., Donnelly, K., Watkins, G. and Knight, D. (2018). Leveraging Lexical Resources and Constraint Grammar for Rule-Based Part-of-Speech Tagging in Welsh. Poster presented at the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1623/

  • CorCenCC’s semantic tagger ‘CySemTagger’: Piao, S., Rayson, P., Knight, D. and Watkins, G. (2018). Towards a Welsh Semantic Annotation System. Proceedings of the LREC (Language Resources Evaluation) 2018 Conference, May 2018, Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1158/

  • CorCenCC’s pedagogic toolkit ‘Y Tiwtiadur’: Davies, J., Thomas, E-M., Fitzpatrick, T., Needs, J., Anthony, L., Cobb, T. and Knight, D. (2020). Y Tiwtiadur. [Digital Resource]. Available at: https://www.corcencc.org/Y-Tiwtiadur

  • CorCenCC’s word frequency lists ‘Yr Amliadur’: Details coming soon

Acknowledgements

This work was carried out as part of the UK Economic and Social Research Council (ESRC) and Arts and Humanities Research Council (AHRC) funded Corpws Cenedlaethol Cymraeg Cyfoes (The National Corpus of Contemporary Welsh): A community driven approach to linguistic corpus construction project (Grant Number ES/M011348/1). For more information, go to www.corcencc.org | www.corcencc.cymru

Documentation provided

About

Documentation associated with the CorCenCC project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published