Search and Browse – ELRA Catalogue

Catalan; Valencian

ID: ELRA-W0327

The AnCora Catalan Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: - Lemma and Part of Speech, - Syntactic constituents and functions, - Argument structure and thematic roles, - Semantic classes of the verb, - Denotative type of deverbal nouns, - Nouns related to W...

MEMBER	academic	commercial
Licence: Attribution, Commercial Use - GPL	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Commercial Use - GPL	0.00 €	0.00 €

AnCora Spanish 2.0.0 text

Spanish; Castilian

ID: ELRA-W0326

ISLRN: 252-495-813-736-1

The AnCora Spanish Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: - Lemma and Part of Speech, - Syntactic constituents and functions, - Argument structure and thematic roles, - Semantic classes of the verb, - Denotative type of deverbal nouns, - Nouns related to W...

MEMBER	academic	commercial
Licence: Attribution, Commercial Use - GPL	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Commercial Use - GPL	0.00 €	0.00 €

Bulgarian Event Corpus text

Bulgarian

ID: ELRA-W0329

ISLRN: 832-960-876-604-2

The Bulgarian Event Corpus is composed 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of So...

MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: ? - CC-BY-SA-3.0	0.00 €
Licence: Attribution, Share Alike - CC-BY-SA-3.0		0.00 €

Bulgarian Treebank Corpus text

Bulgarian

ID: ELRA-W0328

ISLRN: 761-430-854-533-2

The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in...

MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0	0.00 €	0.00 €

Corpus of Icelandic texts from the Central Bank of Iceland (Processed) text

Icelandic

ID: ELRA-W0298

ISLRN: 420-670-865-427-1

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Corpus of Icelandic texts from the Central Bank of Icela...

MEMBER	academic	commercial
Licence: Attribution, Other - Open Under-PSI	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Other - Open Under-PSI	0.00 €	0.00 €

Danish Gigaword Corpus text

Danish

ID: ELRA-W0318

ISLRN: 024-504-318-388-3

The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is: 1. representative; 2. accessible; 3. a suitable common starting point for Danish NLP models. The present version 1.0 was collected from various webs...

MEMBER	academic	commercial
Licence: Attribution - CC-BY-4.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution - CC-BY-4.0	0.00 €	0.00 €

Ema-lon Manipuri Corpus (including word embedding and language model) text

English
Manipuri

ID: ELRA-W0316

ISLRN: 588-170-827-016-7

The Ema-lon Manipuri Corpus consists of a set of resources for Manipuri language (locally known as Meiteilon) for the purpose of machine translation. The main source for these resources is the Sangai Express news website. The resources that constitute the present corpus are listed below: 1. EM C...

MEMBER	academic	commercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0	0.00 €	0.00 €

German Political Speeches Corpus text

German

ID: ELRA-W0330

ISLRN: 381-445-879-769-5

This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984 to February 17, 2012 and the corpus con...

MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA	0.00 €	0.00 €

How2Sign Dataset text

American Sign Language
English

ID: ELRA-S0416

ISLRN: 583-408-694-292-6

The How2Sign dataset consists of a parallel corpus of speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translation videos and annotations. It has been produced by recording 11 persons (6 males and 5 females) with various hearing status (5 self...

MEMBER	academic	commercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0

NON MEMBER	academic	commercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0

Monolingual documents from the Government of Lithuania (Processed) text

Lithuanian

ID: ELRA-W0299

ISLRN: 268-109-862-136-1

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Monolingual documents received from the Government of th...

MEMBER	academic	commercial
Licence: Attribution - CC-BY-4.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution - CC-BY-4.0	0.00 €	0.00 €

Corpus:
Lexical/Conceptual:
Tool/Service:
Language Description:

Text:
Audio:
Image:
Video:
Text Numerical:
Text N-Gram:

Resource Type:

Media Type:

10 Language Resources