Corpus Linguistics | SSH Training Discovery Toolkit

Item
Title	Body
Computer-mediated communication corpora	This is a list of computer-mediated communication corpora that are available as part of the CLARIN Resource Families initiative. Computer-mediated communication (CMC) constitutes public and private communication on-line, such as posts on blogs, forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, instant chat rooms such as, mobile phone applications such as WhatsApp and e-mail. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also very important for the development of robust NLP tools that can deal with non-standard spelling, vocabulary and grammar. Compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as resource to the scientific community, which is further exacerbated by the rapidly changing terms of service by content providers.

Source
Title	Body
CLARIN Legal Information Platform	The platform aims to introduce researchers with basic notions related to the legislative and licensing framework in Europe on Copyright and Data Protection: Introduction to Copyright and Related Rights Licensing Practice Personal Data Protection It also includes proposals for: Further reading/Bibliography on Legal and Ethical Issues Useful links on Legal and Ethical Issues
CLARIN Depositing Services	One of the fundamental services of the CLARIN infrastructure is making sure that language resources can be archived and made available to the community in a reliable manner. To help researchers to store their resources (e.g. corpora, lexica, audio and video recordings, annotations, grammars, etc.) in a sustainable way, many of the CLARIN centres offer a depositing service. They are willing to store the resources in their repository and assist with the technical and organisational details. This has a wide range of advantages: Long-term archiving: a storage guarantee can be given for a long period (up to 50 years in some cases) Resources can be cited easily with a persistent identifier. The resources and their metadata will be integrated into the infrastructure, making it possibe to search them efficiently. Password-protected resources can be made available via an institutional login. Once resources are integrated in the CLARIN infrastructure, they can be analyzed and enriched more easily with various linguistic tools (e.g. automated part-of-speech tagging, phonetic alignment or audio/video analysis).
CLARIN Resource Families	The aim of the CLARIN Resource Families initiative is to provide a user-friendly overview of the available language resources in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The overviews are organized according to the types of data in the resources and include listings sorted by language. The listings include the most important metadata and brief descriptions, such as resource size, text sources, time periods, annotations and licences as well as links to download pages and concordancers, whenever available. In addition to the resources found in the CLARIN infrastructure, CLARIN Resource Families provides an overview of other existing valuable language resources which have not yet been integrated in the infrastructure. CLARIN Resource Families also provides hyperlinks to other relevant materials such as the thematic CLARIN workshops and tutorials and their accompanying videolectures, as well as a list of key publications on the resources surveyed.