Search and Browse – ELRA Catalogue

Collins Multilingual database (MLD) - WordBank text

Arabic
Bengali
Chinese
Croatian
Czech
Danish
Dutch; Flemish
English
Finnish
French
German
Hindi
Italian
Japanese
Korean
Malayalam
Modern Greek (1453-)
Norwegian
Polish
Portuguese
Romanian; Moldavian; Moldovan
Russian
Spanish; Castilian
Swedish
Tamil
Thai
Turkish
Ukrainian
Vietnamese

ID: ELRA-T0376

The Collins Multilingual database covers Real Life Daily vocabulary. It is composed of a multilingual lexicon in 32 languages (the WordBank) and a multilingual set of sentences in 28 languages (the PhraseBank, distributed separately under reference ELRA-T0377). The WordBank contains 10,000 words...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	2400.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	3600.00 €

GlobalPhone 2000 Speaker Package audio

Arabic
Bulgarian
Chinese
Croatian
Czech
French
German
Hausa
Japanese
Korean
Polish
Portuguese
Russian
Spanish; Castilian
Swahili (macrolanguage)
Swedish
Tamil
Thai
Turkish
Ukrainian
Vietnamese

ID: ELRA-S0400

ISLRN: 331-592-378-424-7

The GlobalPhone 2000 Speaker Package contains transcribed read speech spoken by 2000 native speakers in 22 languages. The data are sampled from the GlobalPhone Speech and Text Data available in the ELRA Catalogue, i.e.: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), C...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1200.00 €	6000.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1400.00 €	7200.00 €
Licence: Commercial Use - ELRA VAR	7200.00 €	7200.00 €

GlobalPhone Multilingual Model Package audio

Arabic
Bulgarian
Chinese
Croatian
Czech
French
German
Hausa
Japanese
Korean
Polish
Portuguese
Russian
Spanish; Castilian
Swahili (macrolanguage)
Swedish
Tamil
Thai
Turkish
Ukrainian
Vietnamese

ID: ELRA-S0399

ISLRN: 204-945-263-927-6

The GlobalPhone Multilingual Model Package contains about 22 hours of transcribed read speech spoken by native speakers in 22 languages. The data are sampled from the GlobalPhone Speech and Text Data available in the ELRA Catalogue, i.e.: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Manda...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1200.00 €	6000.00 €
Licence: Commercial Use - ELRA VAR	6000.00 €	6000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	1400.00 €	7200.00 €
Licence: Commercial Use - ELRA VAR	7200.00 €	7200.00 €

GlobalPhone Tamil audio

Tamil

ID: ELRA-S0205

ISLRN: 269-930-371-035-1

The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, mu...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	100.00 €	500.00 €
Licence: Commercial Use - ELRA VAR	500.00 €	500.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	125.00 €	600.00 €
Licence: Commercial Use - ELRA VAR	600.00 €	600.00 €

Special offers are also available. Check here for details.

Parallel Corpora for 6 Indian Languages text

Bengali
English
Hindi
Malayalam
Tamil
Telugu
Urdu

ID: ELRA-W0320

ISLRN: 657-350-757-058-6

The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 wo...

MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0	0.00 €	0.00 €

NON MEMBER	academic	commercial
Licence: Attribution, Share Alike - CC-BY-SA-3.0	0.00 €	0.00 €

The EMILLE/CIIL Corpus text

Assamese
Bengali
English
Gujarati
Hindi
Kannada
Kashmiri
Malayalam
Marathi
Oriya (macrolanguage)
Panjabi; Punjabi
Sinhala; Sinhalese
Tamil
Telugu
Urdu

ID: ELRA-W0037

ISLRN: 039-846-040-604-0

The EMILLE/CIIL Corpus consists of three components: monolingual, parallel and annotated corpora. There are fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri, Malayala...

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €

The EMILLE Lancaster Corpus text

Bengali
English
Gujarati
Hindi
Panjabi; Punjabi
Sinhala; Sinhalese
Tamil
Urdu

ID: ELRA-W0038

ISLRN: 438-045-014-925-0

The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora. There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu. The EMILLE monolingual corpora contain approximately 58,880,000 words (i...

MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		7500.00 €

NON MEMBER	academic	commercial
Licence: Commercial Use - ELRA VAR		12000.00 €

Corpus:
Lexical/Conceptual:
Tool/Service:
Language Description:

Text:
Audio:
Image:
Video:
Text Numerical:
Text N-Gram:

Resource Type:

Media Type:

7 Language Resources