ELRA CATALOGUE

1,629 language resources at your disposal

An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

Latest Resources

Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS)
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS) contains 67 recorded sessions of interactions between two native Slovak speakers. In 37 sessions an autistic child interacts with a neurotypical adult experimenter, and in 30 control sessions a neurotypical child interacts with the same neurotypical adult experimenter. The children were 6-12 ...
DiaLEX – Emirati (DiaLEX-UA)
The Emirati Arabic Full-Form Lexicon (DiaLEX-UA) is a comprehensive computational lexicon covering the Emirati Arabic dialect. Featuring over 28,000,000 forms for 29,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-UA has several features that make it ideally suited to support natural language processing applications for Emirati ...
DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA)
The Hijazi Arabic Full-Form Lexicon (DiaLEX-HA) is a comprehensive computational lexicon covering the Hijazi Arabic dialect. Featuring over 21,000,000 forms for 30,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-HA has several features that make it ideally suited to support natural language processing applications for Hijazi ...
DiaLEX – Egyptian (DiaLEX-EA)
The Egyptian Arabic Full-Form Lexicon (DiaLEX-EA) is a comprehensive computational lexicon covering the Egyptian Arabic dialect. Featuring over 78,000,000 forms for 31,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-EA has several features that make it ideally suited to support natural language processing applications for Egyptian ...
Corpus for fine-grained analysis and automatic detection of irony on Twitter
The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. The ...
AUDIO Human Voice Pronunciations - Polish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Dutch
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Catalan
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Greek
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - English
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Arabic
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Korean
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Russian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Spanish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Chinese (Simplified)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Italian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Turkish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Swedish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Portugal)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Danish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Czech
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Japanese
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Hebrew
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Thai
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Brazil)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Norwegian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
English BIO Biographical Names (Multilingual)
This dataset consists of 4,200 dictionary entries regarding prominent persons worldwide. A similarly designed dataset for geographical locations is available as a separate package (ELRA-L0204-02).
GEOLINGUAL Multilingual Geographical Entity Tables
A table of over 200 countries and other major geographical names worldwide – including their adjectives, persons, and main languages – in the following languages: Arabic, Chinese Simplified, Danish, Dutch, English, French, German, Greek, Hebrew, Japanese, Korean, Polish, Portuguese, Russian, Spanish, and Turkish.
English GEO Geographical Names (Multilingual)
This dataset consists of 7,200 dictionary entries regarding major locations worldwide. A similarly designed dataset for prominent persons (biographical names) is available as a separate package (ELRA-L0204-01).
Morphological lexicon - Slovak
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Hebrew
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Russian
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Norwegian Nynorsk
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Japanese
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
MULTIGLOSS Multilingual Glossaries - L1-English pair
A series of innovative multilingual word-to-sense glossaries, based on a human-edited word-to-sense bilingual index of each language to English, which is linked automatically to the translation equivalents in 45 target languages. Each word and expression in every language is translated via its corresponding sense in English into 44 of these ...
MULTIGLOSS Multilingual Glossaries - L1-English pair + 1 language
A series of innovative multilingual word-to-sense glossaries, based on a human-edited word-to-sense bilingual index of each language to English, which is linked automatically to the translation equivalents in 45 target languages. Each word and expression in every language is translated via its corresponding sense in English into 44 of these ...
Morphological lexicon - Portuguese
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - German
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Norwegian Bokmål
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Italian
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Dutch
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - French
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - English
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Morphological lexicon - Swedish
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
Parallel Corpora & Domains (bilingual and multilingual)
Parallel corpora for nearly 400 language pairs and numerous multilingual combinations, including 10 million bilingual segments and 90 million tokens in 20 languages: Arabic, Chinese (Simplified), Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Italian, Japanese, Korean, North Sami, Norwegian, Polish, Portuguese (Brazilian and European), Russian, Spanish, Swedish, and Turkish. ...
Morphological lexicon - Korean
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
EWA-DB – Early Warning of Alzheimer speech database
EWA-DB is a speech database that contains data from 3 clinical groups: Alzheimer's disease, Parkinson's disease, mild cognitive impairment, and a control group of healthy subjects. Speech samples of each clinical group were obtained using the EWA smartphone application, which contains 4 different language tasks: sustained vowel phonation, diadochokinesis, object ...
Morphological lexicon - Spanish
Morphological lists linking inflected forms to their lemmas, distributed as follows (catalogue references from ELRA-L0203-01 to ELRA-L0203-15): Language Code Lemmas Word forms Dutch nl 157,000 205,603 English en 69,308 160,441 French fr 79,843 442,085 German de 95,282 456,244 Hebrew he 25,351 862,260 Italian it 28,722 303,025 Japanese ja 265,565 398,508 ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 1
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 3
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 3
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Monolingual - Level 2
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 2
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...
GLOBAL Multilingual Lexical Data - Bilingual - Level 1
The GLOBAL Multilingual Lexical Data (references ELRA-M0111-01 to ELRA-M0111-06 in the ELRA Catalogue) consists of a network of lexicographic cores for major world languages, comprising diverse monolingual, bilingual and multilingual combinations, in different sizes, originally built for language learning and translation. They are available in XML, JSON or JSON-LD (RDF) ...

Show less