Linguistics

Item
Title Body
Linguistic annotation of corpora

This scenario explains the steps to take to annotate a corpus in order to conduct linguistic and statistical analysis based on it. This scenario wants to provide general information for people starting with linguistic annotation. The aim is to provide a generic scenario with no specific tool in mind: We refer to tools but don't specify how to use them. There are various tools and frameworks to perform the steps in this scenario that can be used depending on the language/s you are working with or your programming environment. A number of tool-boxes for Natural language processing (NLP) exist, which are able to perform several of the annotation steps in an integrated way. These resources are listed below under "Using an existing NLP pipeline". The next step after performing the procedures described in this scenario usually is to put the annotated corpus into a corpus query engine to query and analyze it based on its annotations. Again, some popular query engines already provide a built-in pipeline that is able to perform the basic processing steps in one go, taking most of the burden off the user.

Creating interoperable TEI text resources with the DTA 'Base Format' (DTABf) Currently, initiatives for the digitization of textual resources and their provision to the interested community are manifold and various. Hence, scholars who want to base their research on digitized texts, especially if working with popular works, may find a considerable amount of resources already digitized and ready to use. However, with these resources originating from various data providers (individual scholars, singular research projects, large infrastructures, ...) scholars usually face a great variety of digitization guidelines and formats. Gathering resources from different sources will hence almost always require their harmonization on different levels of processing. The CLARIN-D center at the Berlin-Brandenburg Academy of Sciences and Humanities provides an infrastructure for this task based on the DTA 'Base Format' (DTABf). The DTABf is a TEI-P5 format for homogeneous text annotation including set of extensive guidelines for text transcription. This scenario describes the steps to take in order to create a homogeneous, DTABf-based text corpus.
Create a dictionary in TEI

This scenario sets out the best practices forcreating a born-digital dictionary, especially with the TEI (Text EncodingInitiative). However, buiding a standardized lexicographical dataset is not only adata format problem, it is also an intellectual and technical process where one hasto choose how to model their data, and with which tools operate in order to create aneasy-to-use and sustainable resource.

Creation of a TEI-based corpus

This scenario explains the steps to take, in order to create a corpus based on the TEI tagset. As of today, the TEI guidelines have become a de facto standard for text annotation, providing solutions for a great variety of text and phrase structures, information on content types, linguistic information on words or phrases, etc. In many digital text collections and digital edition projects annotation has been based on the TEI. Linguistic corpora based on TEI may thus be re-used in projects of other disciplines as well or may themselves benefit from the wide range of already existing resources.

Source
Title Body
DigiLex – Legacy Dictionaries Reloaded

DigiLex is a platform for sharing tips, raising questions and discussing methods  for the creation, application and dissemination of born-digital and retro-digitized lexical resources (dictionaries, lexicons, thesauri, word lists etc.)

TeLeMaCo

TeLeMaCo stands for Teaching and Learning Materials Collection. It is a collaborative portal for all kind of training and teaching materials relevant to linguistics and digital humanities.

The range of described materials includes quickstarts, FAQs, technical documentation and user documentation for tools, small teaching modules and even entire courses. They can be of any kind of digital media, including text, image, audio, video, and interactive training.

CLARIN Knowledge Sharing

The aim of the CLARIN Knowledge Sharing Initiative is to ensure ensure that the available knowledge and expertise provided by CLARIN consortia does not exist as a fragmented collection of unconnected bits and pieces, but is made accessible in an organized way to the CLARIN community and to the Social Sciences and Humanities research community at large. 

One central step in building the Knowledge Sharing Infrastructure is the establishment of Knowledge Centres. Most existing CLARIN centres are able to get the status of a Knowledge Centre right away; the K-Centres rather formalize and centrally register the existing expertise but does usually not require much additional effort from an institute except that the knowledge-sharing services have to be reliable and their skope has to be made explicit on a dedicated web-page of the respective institute(s).

The list of CLARIN Knowledge Centres is available here: 

https://www.clarin.eu/content/knowledge-centres