15 Language Resources

Order by:

 AnCora Catalan 2.0.0    
  • Catalan; Valencian

ID: ELRA-W0327

ISLRN: 186-654-762-852-8

The AnCora Catalan Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: - Lemma and Part of Speech, - Syntactic constituents and functions, - Argument structure and thematic roles, - Semantic classes of the verb, - Denotative type of deverbal nouns, - Nouns related to W...

MEMBERacademiccommercial
Licence: Attribution, Commercial Use - GPL
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Commercial Use - GPL
0.00 € submit
0.00 € submit
 AnCora Spanish 2.0.0    
  • Spanish; Castilian

ID: ELRA-W0326

ISLRN: 252-495-813-736-1

The AnCora Spanish Corpus 2.0.0 is a corpus of 500,000 words annotated at different levels: - Lemma and Part of Speech, - Syntactic constituents and functions, - Argument structure and thematic roles, - Semantic classes of the verb, - Denotative type of deverbal nouns, - Nouns related to W...

MEMBERacademiccommercial
Licence: Attribution, Commercial Use - GPL
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Commercial Use - GPL
0.00 € submit
0.00 € submit
 Arabic Speech Corpus    
  • Arabic

ID: ELRA-S0384

ISLRN: 866-568-447-697-8

This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript w...

MEMBERacademiccommercial
Licence: Commercial Use - ELRA VAR
9000.00 € submit
Licence: Attribution - CC-BY
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Commercial Use - ELRA VAR
11200.00 € submit
Licence: Attribution - CC-BY
0.00 € submit
0.00 € submit
 Bulgarian Event Corpus    
  • Bulgarian

ID: ELRA-W0329

ISLRN: 832-960-876-604-2

The Bulgarian Event Corpus is composed 324,905 tokens appropriate for training Named Entity Recognition (NER), Named Entity Linking (NEL) and Event Recognition models for Bulgarian in a multidomain context within Humanities. The texts are domain related. They include documents from the area of So...

MEMBERacademiccommercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: ? - CC-BY-SA-3.0
0.00 € submit
Licence: Attribution, Share Alike - CC-BY-SA-3.0
0.00 € submit
 Bulgarian Treebank Corpus    
  • Bulgarian

ID: ELRA-W0328

ISLRN: 761-430-854-533-2

The Bulgarian Treebank Corpus is composed of 156,149 tokens (11,138 sentences) coming from three main sources in the domain of Grammar Notebooks (1,391 sentences), News (6,698 sentences), Other (3,049 sentences). It is available with syntactical and morphological annotation on a sentence basis in...

MEMBERacademiccommercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0
0.00 € submit
0.00 € submit
 Corpus of Icelandic texts from the Central Bank of Iceland (Processed)    
  • Icelandic

ID: ELRA-W0298

ISLRN: 420-670-865-427-1

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Corpus of Icelandic texts from the Central Bank of Icela...

MEMBERacademiccommercial
Licence: Attribution, Other - Open Under-PSI
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Other - Open Under-PSI
0.00 € submit
0.00 € submit
 Danish Gigaword Corpus    
  • Danish

ID: ELRA-W0318

ISLRN: 024-504-318-388-3

The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is: 1. representative; 2. accessible; 3. a suitable common starting point for Danish NLP models. The present version 1.0 was collected from various webs...

MEMBERacademiccommercial
Licence: Attribution - CC-BY-4.0
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution - CC-BY-4.0
0.00 € submit
0.00 € submit
 Ema-lon Manipuri Corpus (including word embedding and language model)    
  • English
  • Manipuri

ID: ELRA-W0316

ISLRN: 588-170-827-016-7

The Ema-lon Manipuri Corpus consists of a set of resources for Manipuri language (locally known as Meiteilon) for the purpose of machine translation. The main source for these resources is the Sangai Express news website. The resources that constitute the present corpus are listed below: 1. EM C...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0
0.00 € submit
0.00 € submit
 German Political Speeches Corpus    
  • German

ID: ELRA-W0330

ISLRN: 381-445-879-769-5

This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984 to February 17, 2012 and the corpus con...

MEMBERacademiccommercial
Licence: Attribution, Share Alike - CC-BY-SA
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Share Alike - CC-BY-SA
0.00 € submit
0.00 € submit
 Glissando-ca    
  • Catalan; Valencian

ID: ELRA-S0407

ISLRN: 780-617-066-913-1

Glissando-ca includes more than 12 hours of speech in Catalan, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 profession...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
 Glissando-sp    
  • Spanish; Castilian

ID: ELRA-S0406

ISLRN: 024-286-962-247-6

Glissando-sp includes more than 12 hours of speech in Spanish, recorded under optimal acoustic conditions, orthographically transcribed, phonetically aligned and annotated with prosodic information (location of the stressed syllables and prosodic phrasing). The corpus was recorded by 8 profession...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
 How2Sign Dataset      
  • American Sign Language
  • English

ID: ELRA-S0416

ISLRN: 583-408-694-292-6

The How2Sign dataset consists of a parallel corpus of speech and transcriptions of instructional videos and their corresponding American Sign Language (ASL) translation videos and annotations. It has been produced by recording 11 persons (6 males and 5 females) with various hearing status (5 self...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use - CC-BY-NC-4.0
 JV_TDM Corpus    
  • French

ID: ELRA-S0379

ISLRN: 371-240-320-910-4

The JV_TDM corpus provides a phonetic annotation of 37 chapters of the original French version of “Around the World in 80 Days” by Jules Verne read by a single speaker. Each chapter has been annotated in a separate .TextGrid file. The audio files are not included in this release. They are availab...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
 Monolingual documents from the Government of Lithuania (Processed)    
  • Lithuanian

ID: ELRA-W0299

ISLRN: 268-109-862-136-1

This dataset has been created within the framework of the European Language Resource Coordination (ELRC) Connecting Europe Facility - Automated Translation (CEF.AT) action. For further information on the project: http://lr-coordination.eu. Monolingual documents received from the Government of th...

MEMBERacademiccommercial
Licence: Attribution - CC-BY-4.0
0.00 € submit
0.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution - CC-BY-4.0
0.00 € submit
0.00 € submit
 Persian Speech Corpus    
  • Persian

ID: ELRA-S0393

ISLRN: 068-845-898-304-0

This about 2.5-hour Single-Speaker Speech corpus has been developed using the same methodologies used in the PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded in Persian (Tehrani accent) by one male speaker using a professional studio, through a "Blubb...

MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
Licence: Commercial Use - ELRA VAR
4000.00 € submit
4000.00 € submit
NON MEMBERacademiccommercial
Licence: Attribution, Non Commercial Use, Share Alike - CC-BY-NC-SA
0.00 € submit
0.00 € submit
Licence: Commercial Use - ELRA VAR
5000.00 € submit
5000.00 € submit