The SSH Training Discovery Toolkit provides an inventory of training materials relevant for the Social Sciences and Humanities.

Use the search bar to discover materials or browse through the collections. The filters will help you identify your area of interest.

 

Text encoding

Item
Title Body
Glossaries

This is a list of glossaries that are available as part of the CLARIN Resource Families initiative.

Glossaries are specialised dictionaries that contain domain-specific terminology and/or expressions. In the vast majority of the cases, the glossaries can be directly downloaded from CLARIN national repositories or queried through easy-to-use online search environments.

Conceptual resources

This is a list of conceptual resources that are available as part of the CLARIN Resource Families initiative.

Concept-based resources include onomasiological lexical resources such as wordnets, framenets, thesauri and ontologies. Such resources are typically interlinked with semantic relations (e.g. hypernymy, hyponymy). In the vast majority of the cases, the conceptual resources can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

Dictionaries

This is a list of dictionaries that are available as part of the CLARIN Resource Families initiative.

Dictionaries were primarily created for human use (e.g., language learning/teaching, translation, lexicology) and are typically semasiological, which means that they are organized around words and contain information on their meanings, definitions, pronunciation, etc. 

Lexica

This is a list of lexica that are available as part of the CLARIN Resource Families initiative.

Lexica are primarily used in NLP applications. They typically contain an extensive lexical inventory with specific linguistic information (e.g., morphosyntax, sentiment). In the vast majority of the cases, the lexica can be directly downloaded from CLARIN repositories or queried through easy-to-use online search environments.

Spoken corpora

This is a list of spoken corpora that are available as part of the CLARIN Resource Families initiative.

Corpora of spoken language contain transcriptions of spontaneous or planned speech, such as broadcast news or elicited narratives and dialogues. They are often aligned with the accompanying recordings. They are an invaluable resource for various kinds of linguistic research, such as phonology, conversational analysis, and dialectology. Such corpora are carefully sampled and rich in sociodemographic metadata. 

Parliamentary corpora

This is a list of parliamentary corpora that are available as part of the CLARIN Resource Families initiative.

Parliamentary corpora are a very important multidisciplinary language resource that can be approached from many research perspectives, including not only political science, but also sociology, history, psychology, and applicative approaches to linguistics, for instance, critical discourse analysis. The good availability of parliamentary proceedings in digitized form and granted access rights to public information in the EU countries have motivated a number of national as well as international initiatives to compile, process and analyse parliamentary corpora.

Manually annotated corpora

This is a list of manually annotated corpora that are available as part of the CLARIN Resource Families initiative.

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools as well as to test the accuracy of existing annotation tools. 

The corpora and corpus collections are classified into 6 categories based on the type of manual annotation:

Parallel corpora

This is a list of parallel corpora that are available as part of the CLARIN Resource Families initiative.

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems. 

Newspaper corpora

This is a list of newspaper corpora that are available as part of the CLARIN Resource Families initiative.

Collections of newspapers in digital form are a rich source of information for researchers in a number of disciplines in the Humanities and Social Sciences and are especially valuable for synchronic as well as diachronic studies, ranging from history, media and communication studies to lexicography for which newspapers are a rich source of neologisms and other lexicographic phenomena.

L2 learner corpora

This is a list of L2 learner corpora that are available as part of the CLARIN Resource Families initiative.

L2 learner corpora play a crucial role in second language research and pedagogy, allowing for a systematic study of how a learner of a second language acquires the new language on a lexical as well as syntactic level, and how it is influenced by his or her native language. A special characteristic of this type of corpora are the markup of errors and prosodic features of the learners.