The SSH Training Discovery Toolkit provides an inventory of training materials relevant for the Social Sciences and Humanities.

Use the search bar to discover materials or browse through the collections. The filters will help you identify your area of interest.

 

Manually curated language resource overviews, curated by discipline

Item
Title Body
Parallel corpora

This is a list of parallel corpora that are available as part of the CLARIN Resource Families initiative.

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems. 

Newspaper corpora

This is a list of newspaper corpora that are available as part of the CLARIN Resource Families initiative.

Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.

Literary corpora

This is a list of literary corpora that are available as part of the CLARIN Resource Families initiative.

Literary corpora comprise poetry and fictional prose texts, such as novels, short stories and plays. They bring together the collected works of a single author or representative from a specific literary period. Since the literary corpora are often available through powerful concordancers, they are especially well suited for a quantitative and qualitative approach to comparative literary analysis, within or across different genres and historical periods.

L2 learner corpora

This is a list of L2 learner corpora that are available as part of the CLARIN Resource Families initiative.

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.

Historical corpora

This is a list of historical corpora that are available as part of the CLARIN Resource Families initiative.

The CLARIN ERIC infrastructure offers access to historical corpora that cover almost all of the languages spoken in countries that are either members or observers in CLARIN ERIC. In the vast majority of cases, the corpora can be directly downloaded from the national repositories or queried through easy-to-use online search environments. They are also richly tagged and mostly available under public licences.

 

Corpora of academic texts

This is a list of academic corpora that are available as part of the CLARIN Resource Families initiative.

Corpora of academic texts contain scholarly writing, which includes research papers, essays and abstracts published in academic journals, conference proceedings, and edited volumes, theses written by students at the undergraduate and graduate levels, and scientific monographs.

 

Computer-mediated communication corpora

This is a list of computer-mediated communication corpora that are available as part of the CLARIN Resource Families initiative.

Computer-mediated communication (CMC) constitutes public and private communication on-line, such as posts on blogs, forums, comments on online news sites, social media and networking sites such as Twitter and Facebook, instant chat rooms such as, mobile phone applications such as WhatsApp and e-mail. Because corpora that compile computer-mediated communication often include very informal styles of writing, they are interesting for a wide range of research fields, such as language variation, pragmatics, media and communication studies, etc. They are also very important for the development of robust NLP tools that can deal with non-standard spelling, vocabulary and grammar. Compilation and dissemination of such corpora are hindered by the unclear legal status of CMC data when distributed as resource to the scientific community, which is further exacerbated by the rapidly changing terms of service by content providers.

Source
Title Body
CLARIN Resource Families

The aim of the CLARIN Resource Families initiative is to provide a user-friendly overview of the available language resources in the CLARIN infrastructure for researchers from digital humanities, social sciences and human language technologies. The overviews are organized according to the types of data in the resources and include listings sorted by language.

The listings include the most important metadata and brief descriptions, such as resource size, text sources, time periods, annotations and licences as well as links to download pages and concordancers, whenever available. In addition to the resources found in the CLARIN infrastructure, CLARIN Resource Families provides an overview of other existing valuable language resources which have not yet been integrated in the infrastructure.

CLARIN Resource Families also provides hyperlinks to other relevant materials such as the thematic CLARIN workshops and tutorials and their accompanying videolectures, as well as a list of key publications on the resources surveyed.