Search and Browse – ELRA Catalogue

Portuguese

ID: ELRA-W0055

The CINTIL-TreeBank is a corpus of syntactic constituency trees of Portuguese texts composed of 10,039 sentences and 110,166 tokens taken from different sources and domains: news (8,861 sentences; 101,430 tokens), novels (399 sentences; 3,082 tokens). In addition, there are 779 sentences (5,654 t...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

Corpus for fine-grained analysis and automatic detection of irony on Twitter text

English

ID: ELRA-W0337

ISLRN: 478-366-550-085-8

The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘si...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	100.00 €
Licence: Commercial Use - ELRA VAR	100.00 €	100.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	200.00 €
Licence: Commercial Use - ELRA VAR	200.00 €	200.00 €

Corpus of Contemporaneous Spanish Novels text

Spanish; Castilian

ID: ELRA-W0041

ISLRN: 837-873-214-287-0

This corpus consists of 11 novels written in Castilian Spanish by Inmaculada Ferrer-Vidal Turull, a contemporaneous author. The list of novels consists of: - La búsqueda: 113,639 words - Tristeza: 41,125 words - Cuarto menguante: 42,419 words - Recuerdos: 55,694 words - Sucedió en Abril: 46,040 w...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	400.00 €	800.00 €
Licence: Commercial Use - ELRA VAR	800.00 €	800.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	500.00 €	1000.00 €
Licence: Commercial Use - ELRA VAR	1000.00 €	1000.00 €

CRATER 2 Corpus text

English
French
Spanish; Castilian

ID: ELRA-W0033

ISLRN: 052-466-219-226-4

The CRATER corpus was built upon the foundations of an earlier project, ET10/63, which was funded in the final phase of the Eurotra programme. The Corpus Resources and Terminology Extraction project (MLAP-93 20) extended the bilingual annotated English-French International Telecommunications Unio...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	25.00 €
Licence: Commercial Use - ELRA VAR	25.00 €	25.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	125.00 €
Licence: Commercial Use - ELRA VAR	125.00 €	125.00 €

CRATER corpus text

English
French
Spanish; Castilian

ID: ELRA-W0003

ISLRN: 645-721-607-031-5

The Corpus Resources and Terminology Extraction project (MLAP-93 20) has extended the bilingual annotated English-French International Telecommunications Union corpus to include Spanish, and has also debugged the existing corpus. The offer consists of a multi-lingual aligned corpus of 1,000,000 t...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	20.00 €
Licence: Commercial Use - ELRA VAR	20.00 €	20.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	100.00 €
Licence: Commercial Use - ELRA VAR	100.00 €	100.00 €

Danish Propbank text

Danish

ID: ELRA-W0117

ISLRN: 213-212-351-142-5

The Danish Propbank (DPB) is a multi-layer treebank, annotated not only with morphosyntactic, but also with semantic information, in particular propositions/frames with VerbNet classes and semantic roles for both arguments and satellites. In addition, the corpus has been annotated with 20 Named E...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	150.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	800.00 €	7000.00 €
Licence: Commercial Use - ELRA VAR	7000.00 €	7000.00 €

deL1L2IM corpus text

German

ID: ELRA-W0083

ISLRN: 339-799-085-669-8

The deL1L2IM corpus, created between May and August 2012 and last updated in August 2014, has been collected within the framework of a PhD project on the development of a learning method implying conversations with an artificial companion. This PhD work is presented as a qualitative investigation...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

Dutch PAROLE Distributable Corpus text

Dutch; Flemish

ID: ELRA-W0019

ISLRN: 440-290-917-102-7

The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus. The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type. The Dutch PAROLE Distributable...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	270.00 €	800.00 €
Licence: Commercial Use - ELRA VAR	1600.00 €	1600.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	300.00 €	1300.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

Special offers are also available. Check here for details.

English-Chinese-Vietnamese Trilingual Parallel Corpus text

Chinese
English
Vietnamese

ID: ELRA-W0314

ISLRN: 637-630-726-817-9

The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	150.00 €	500.00 €
Licence: Commercial Use - ELRA VAR	1000.00 €	1000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	225.00 €	750.00 €
Licence: Commercial Use - ELRA VAR	1500.00 €	1500.00 €

English-Persian parallel corpus text

English
Persian

ID: ELRA-W0118

ISLRN: 074-825-114-781-7

The English-Persian parallel corpus contains more than 200,000 aligned sentences across a variety of text types from the domains of art, law, culture, science, religion, literature, medicine, idioms, politics and others. It is an extension of the English-Persian parallel corpus already distribute...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1000.00 €	5000.00 €
Licence: Commercial Use - ELRA VAR	5000.00 €	5000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1200.00 €	6000.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

English-Persian parallel Corpus text

English
Persian

ID: ELRA-W0051

ISLRN: 671-618-321-687-7

Please refer to ELRA-W0118 for the latest version of this corpus. This version consists of about 3,500,000 English and Persian (Farsi) words aligned at sentence level (about 100,000 sentences, distributed over 50,021 entries). The format of the files is Unicode. It has been originally created wi...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	500.00 €	2500.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	600.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	3000.00 €	3000.00 €

English-Punjabi Code-Mixed Social Media Content text

English
Panjabi; Punjabi

ID: ELRA-W0319

ISLRN: 695-759-706-170-8

The English-Punjabi Code-Mixed Social Media Content corpus is composed is composed of 893,615 parallel sentences of English-Punjabi distributed over the following domains: - 82,341 parallel sentences of English-Punjabi code-mixed Agriculture Domain Data, - 59,158 parallel sentences of English-P...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

English-Vietnamese Parallel Corpus text

English
Vietnamese

ID: ELRA-W0124

ISLRN: 838-483-738-912-8

This is a corpus of 500,000 English-Vietnamese sentence pairs, built to develop SMT (Statistical Machine Translation) systems. The parallel corpus contains English documents translated by professional translators into Vietnamese. The source texts include books, dictionaries, newspapers, online ne...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	600.00 €	1200.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1000.00 €	2000.00 €
Licence: Commercial Use - ELRA VAR	8000.00 €	8000.00 €

English-Vietnamese Parallel Corpus text

English
Vietnamese

ID: ELRA-W0311

ISLRN: 893-470-491-825-6

The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	900.00 €	1800.00 €
Licence: Commercial Use - ELRA VAR	9000.00 €	9000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1500.00 €	3000.00 €
Licence: Commercial Use - ELRA VAR	12000.00 €	12000.00 €

EUROPARL Corpus Parallel Corpora: Portuguese-English text

English
Portuguese

ID: ELRA-W0090

ISLRN: 435-502-922-727-2

The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the European Parliament. It contains transcriptions of sessions dating back from 1996 to 2011, with a total of approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

Helsinki Corpus of Swahili text

Swahili (macrolanguage)

ID: ELRA-W0119

ISLRN: 941-187-059-145-7

This is a text corpus of Swahili language of 25 million words, annotated for part-of-speech, morphology and syntax. The corpus contains prose text from fiction, news media and government documents domains, from the period between 1953 and 2016. This package contains: - the Helsinki Corpus of Swa...

MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR	7500.00 €	7500.00 €

NON MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR	15000.00 €	15000.00 €

Italian Syntactic-Semantic Treebank (ISST) text

Italian

ID: ELRA-W0044

ISLRN: 927-246-660-947-9

ISST comprises 89,941 tokens for the financial-domain part and 215,606 tokens for the general part. It is formatted in XML. ISST has a five-level structure covering orthographic, morpho-syntactic, syntactic and semantic levels of linguistic description. Syntactic annotation is distributed over t...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	100.00 €	1500.00 €
Licence: Commercial Use - ELRA VAR	1500.00 €	1500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	150.00 €	2500.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

Karl May Korpus (KMK) text

German

ID: ELRA-W0016

ISLRN: 628-817-117-400-1

The "Karl-May-Korpus" is a monolingual German corpus, available in an SGML-tagged ASCII text format. It contains the works of the German author Karl May (1842-1912) and consists of around 1.6 million words (divided into 9 subcorpora of about 180,000 words each). The corpus was created between 199...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	400.00 €	2500.00 €
Licence: Commercial Use - ELRA VAR	2500.00 €	2500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	800.00 €	3500.00 €
Licence: Commercial Use - ELRA VAR	3500.00 €	3500.00 €

Khresmoi manually annotated reference corpus text

English

ID: ELRA-W0081

ISLRN: 764-036-829-417-7

The Manually Annotated Reference Corpus is a collection of English web documents annotated with key entities (such as disease, drug), built in the framework of the Khresmoi project, funded by the European Commission. It has been constructed by first annotating these entities with an imperfect aut...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	0.00 €
Licence: Commercial Use - ELRA VAR	0.00 €	0.00 €

Korean-Vietnamese Parallel Corpus text

Korean
Vietnamese

ID: ELRA-W0313

ISLRN: 365-128-449-700-7

The Korean-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	200.00 €	400.00 €
Licence: Commercial Use - ELRA VAR	1400.00 €	1400.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	300.00 €	600.00 €
Licence: Commercial Use - ELRA VAR	2100.00 €	2100.00 €

Corpus:
Lexical/Conceptual:
Tool/Service:
Language Description:

Text:
Audio:
Image:
Video:
Text Numerical:
Text N-Gram:

Resource Type:

Media Type:

114 Language Resources (Page 2 of 6)