The SSH Training Discovery Toolkit provides an inventory of training materials relevant for the Social Sciences and Humanities.

Use the search bar to discover materials or browse through the collections. The filters will help you identify your area of interest.

 

Computational Linguistics

Item
Title Body
Computational Morphology with HFST

The course demonstrates how HFST tools can be used for generating finite-state morphologies. Through practical exercises, students will learn how to use finite-state methods to develop a morphology for a language. This online course is suitable as a complement to a more theory or linguistics-oriented course on morphology.

After successfully completing the course:

- you can explain the basic theory on finite-state automata and transducers,

- you can design morphological lexica using finite-state technology,

- you know how to write morpho-phonological rules in a finite-state framework,

- you understand the diversity of morphological structure in different languages

 and you know how to take these differences into account when designing computational models of morphology.

 

Taken from Teaching with CLARIN: https://www.clarin.eu/content/computational-morphology-hfst 

Copyright & Related rights

This section is an introduction to copyright notions and related rights:

Tools for named entity recognition

This is a list of tools for named entity recognition that are available as part of the CLARIN Resource Families initiative.

Named entity recognition (NER) is an information extraction task which identifies mentions of various named entities in unstructured text and classifies them into predetermined categories, such as person names, organisations, locations, date/time, monetary values, and so forth. They can, for example, help with the classification of news content, content recommentations and search algorithms.

Tools for normalization

This is a list of tools for text normalization that are available as part of the CLARIN Resource Families initiative.

Text normalization is the process of transforming parts of a text into a single canonical form. It represents one of the key stages of linguistic processing for texts in which spelling variation abounds or deviates from the contemporary norm, such as in texts published in historical documents or on social media. After text normalization, standard tools for all further stages of text processing can be used. Another important advantage of text normalization is improved search which can be performed with querying a single, standard variant but takes into account all its spelling variants, be it historical, dialectal, colloquial or slang.

Wordlists

This is a list of wordlists that are available as part of the CLARIN Resource Families initiative.

Wordlists are lexical resources which only provide alphabetical or frequency-based lexical inventories. In the vast majority of the cases, the wordlists can be directly downloaded from CLARIN national repositories or queried through easy-to-use online search environments.

Conceptual resources

This is a list of conceptual resources that are available as part of the CLARIN Resource Families initiative.

Concept-based resources include onomasiological lexical resources such as wordnets, framenets, thesauri and ontologies. Such resources are typically interlinked with semantic relations (e.g. hypernymy, hyponymy). In the vast majority of the cases, the conceptual resources can be directly downloaded from the national repositories or queried through easy-to-use online search environments.

Lexica

This is a list of lexica that are available as part of the CLARIN Resource Families initiative.

Lexica are primarily used in NLP applications. They typically contain an extensive lexical inventory with specific linguistic information (e.g., morphosyntax, sentiment). In the vast majority of the cases, the lexica can be directly downloaded from CLARIN repositories or queried through easy-to-use online search environments.

Parliamentary corpora

This is a list of parliamentary corpora that are available as part of the CLARIN Resource Families initiative.

Parliamentary corpora are a very important multidisciplinary language resource that can be approached from many research perspectives, including not only political science, but also sociology, history, psychology, and applicative approaches to linguistics, for instance, critical discourse analysis. The good availability of parliamentary proceedings in digitized form and granted access rights to public information in the EU countries have motivated a number of national as well as international initiatives to compile, process and analyse parliamentary corpora.

Manually annotated corpora

This is a list of manually annotated corpora that are available as part of the CLARIN Resource Families initiative.

Manual corpora are collections of texts containing manually validated or manually assigned linguistic information, such as morphosyntactic tags, lemmas, syntactic parses, named entities etc. These corpora can be used to train new language annotation tools as well as to test the accuracy of existing annotation tools. 

The corpora and corpus collections are classified into 6 categories based on the type of manual annotation:

Parallel corpora

This is a list of parallel corpora that are available as part of the CLARIN Resource Families initiative.

Parallel corpora are central to translation studies and contrastive linguistics. Many of the parallel corpora are accessible through easy-to-use concordancers which considerably facilitates the study of interlinguistic phenomena. Such corpora are also a rich source of materials for language teaching. Furthermore, parallel corpora serve as training data for statistical machine translation systems.