ELRA CATALOGUE

1,594 language resources at your disposal

An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

Latest Resources

ReSSInt-EMG (Spanish EMG and Speech Database)
ReSSInt-EMG (Spanish EMG and Speech Database) has been generated in the framework of the ReSSInt project (Voice restoration with silent EMG speech interfaces) and its continuation project DeepRestore (Deep learning approaches for speech restoration from face movement biosignals), coordinated research projects funded by the Spanish Ministry of Science and Innovation, ...
Arab Full Names Database
The Arab Full Names Database covers over six million Arab Full Names. Optionally, if heteronyms (same spelling, different pronunciations, like Muhammad and Muhammid) are included, the number of entries is approximately 43.9 million. These are names of real people, not names generated by algorithm. Phonological data such as romanization and ...
MiLQ: Mixed-Language Query Test Set for Bilingual Web Search – Evaluation Package
MiLQ is a benchmark of mixed-language (code-switched) search queries created by bilingual speakers for evaluating Information Retrieval with mixed-language queries. It provides query versions where English expressions are embedded within native-language structures. This work is derived from The CLEF Test Suite for the CLEF 2000-2003 Campaigns available in the ELRA ...
Chinese Kids Speech database (Upper Grade)
The Chinese Kids Speech database (Upper Grade) contains the total recordings of 161 Chinese Kids speakers (71 males and 90 females), from 10 to 12 years’ old recorded in quiet rooms using smartphone. This database may be combined with the Chinese Kids Speech database (Lower Grade) also available in the ...
Chinese Kids Speech database (Lower Grade)
The Chinese Kids Speech database (Lower Grade) contains the total recordings of 184 Chinese Kids speakers (98 males and 86 females), from 6 to 10 years’ old recorded in quiet rooms using smartphone. This database may be combined with the Chinese Kids Speech database (Upper Grade) also available in the ...
EthioSpeech
EthioSpeech Corpora is comprised of over 391 hours of recorded read speech in six different Ethiopian languages by ca. 200 speakers per language: Amharic (68 hours), Tigrigna (62 hours), Oromo (70 hours), Somali (56 hours), Afar (68 hours), and Sidama (68 hours). The dominating domain is media (mainly newspapers), but ...
Comprehensive Arabic Phonetic Database
The Comprehensive Arabic Phonetic Database is a robust and detailed linguistic resource offering both phonemic and phonetic transcriptions, precisely reflecting how Modern Standard Arabic words are realized in actual speech. This database is ideally suited for speech technology applications. This is a highly comprehensive and accurate Arabic phonetic/phonemic database, covering ...
British English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 302 speakers, including 149 males and 153 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news and tweets. Speech samples ...
Argentina Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office environment over 3 channels and collected from a total of 300 speakers, including 132 males and 168 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, daily dialogues and ...
Italian Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 201 speakers, including 101 males and 100 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
German Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 203 speakers, including 110 males and 93 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
Mexican Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office environment over 3 channels and collected from a total of 826 speakers, including 408 males and 418 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
Urdu Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 203 speakers, including 109 males and 194 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Italian English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 213 speakers, including 103 males and 110 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
German English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 196 speakers, including 88 males and 108 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news and tweets. Speech samples ...
Hong Kong English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 200 speakers, including 99 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, forums, text messages ...
Hindi Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 196 speakers, including 95 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Malaysian Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 131 speakers, including 65 males and 66 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily dialogues. Speech samples ...
Chilean Spanish Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 101 males and 99 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
French English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 225 speakers, including 107 males and 118 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Australian English Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office/home environment over 4 channels and collected from a total of 198 speakers, including 85 males and 113 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Spain English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 200 speakers, including 99 males and 101 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts come from news, daily dialogues and tweets. ...
UAE Arabic Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 2 channels and collected from a total of 168 speakers, including 94 males and 74 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news and daily dialogues. ...
Japanese Speech Recognition Corpus (Telephone)
This corpus was recorded in a quiet office/home environment and collected from a total of 201 speakers, including 96 males and 105 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily dialogues. Speech samples ...
Argentina Spanish Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 81 males and 119 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples ...
American English Speech Recognition Corpus (Desktop)
This corpus was recorded in both quiet and noisy environments over 2 channels and collected from a total of 50 speakers, including 24 males and 26 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as text messages ...
Portuguese Speech Recognition Corpus (Desktop+Mobile)
This corpus was recorded in a quiet office environment over 2 channels and collected from a total of 200 speakers, including 102 males and 98 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as keywords. Speech samples ...
Korean Speech Recognition corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 500 speakers, including 246 males and 254 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech samples are stored as ...
Portugal English Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 201 speakers, including 90 males and 111 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Chilean Spanish Speech Recognition Corpus (Mobile)
This corpus was recorded in a quiet office/home environment over 3 channels and collected from a total of 300 speakers, including 138 males and 162 females, all of whom have been carefully screened to ensure their standard and clear pronunciation.The audio scripts cover information such as news, daily dialogues and ...
Hindi Speech Recognition Corpus (Mobile)
This corpus was recorded in both quiet and noisy environments over 3 channels and collected from a total of 180 speakers, including 99 males and 81 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news. Speech ...
Telugu Speech Recognition corpus (Mobile)
This corpus was recorded in a quiet office/home environment and collected from a total of 130 speakers, including 67 males and 63 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news, daily dialogues and tweets. Speech ...
Indonesian Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office environment over 4 channels and collected from a total of 200 speakers, including 97 males and 103 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
Thai Speech Recognition Corpus (Desktop)
This corpus was recorded in a quiet office/home environment over 4 channels and collected from a total of 205 speakers, including 101 males and 104 females, all of whom have been carefully screened to ensure their standard and clear pronunciation. The audio scripts cover information such as news and daily ...
ÌròyìnSpeech
A modern, high-fidelity, multi-speaker, Yorùbá read speech corpus suitable for Speech Synthesis, Automatic Speech Recognition and Computational Linguistics research. The subject matter is drawn from the Broadcast News domain as well as fictional texts, delivering a multi-purpose, contemporary speech dataset. This corpus consists in 34000 read sentences, 42 hours of ...
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS)
Slovak Autistic and Non-Autistic Child Speech Corpus (SANACS) contains 67 recorded sessions of interactions between two native Slovak speakers. In 37 sessions an autistic child interacts with a neurotypical adult experimenter, and in 30 control sessions a neurotypical child interacts with the same neurotypical adult experimenter. The children were 6-12 ...
DiaLEX – Emirati (DiaLEX-UA)
The Emirati Arabic Full-Form Lexicon (DiaLEX-UA) is a comprehensive computational lexicon covering the Emirati Arabic dialect. Featuring over 37,000,000 forms for 29,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-UA has several features that make it ideally suited to support natural language processing applications for Emirati ...
DiaLEX – Saudi Arabian Hijazi (DiaLEX-HA)
The Hijazi Arabic Full-Form Lexicon (DiaLEX-HA) is a comprehensive computational lexicon covering the Hijazi Arabic dialect. Featuring over 25,000,000 forms for 30,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-HA has several features that make it ideally suited to support natural language processing applications for Hijazi ...
DiaLEX – Egyptian (DiaLEX-EA)
The Egyptian Arabic Full-Form Lexicon (DiaLEX-EA) is a comprehensive computational lexicon covering the Egyptian Arabic dialect. Featuring over 93,000,000 forms for 33,000 lemmas, this full-form lexicon provides exhaustive treatment of all inflected forms. DiaLEX-EA has several features that make it ideally suited to support natural language processing applications for Egyptian ...
Corpus for fine-grained analysis and automatic detection of irony on Twitter
The Corpus for fine-grained analysis and automatic detection of irony on Twitter was carefully annotated by trained annotators (Master’s students in Linguistics) using a detailed annotation scheme for irony categorization, which describes four labels: ‘ironic by means of a polarity contrast’, ‘situational irony’, ‘other verbal irony’ and ‘not ironic’. The ...
AUDIO Human Voice Pronunciations - Catalan
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Thai
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Dutch
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Spanish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Danish
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Brazil)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Portuguese (Portugal)
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Italian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Greek
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Korean
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Japanese
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Russian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Czech
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...
AUDIO Human Voice Pronunciations - Norwegian
Human voice recordings of single-word lemmas and multiword expressions, besides IPA (International Phonetic Alphabet) and alternative scripts (Japanese – Romaji/Kanji/Hiragana; Chinese – Pinyin; Arabic and Hebrew – w/out diacritics), distributed as distinct sets (from ELRA-S0490-01 to ELRA-S0490-21) as follows: • Arabic: 8,119 entries • Catalan: 2,247 entries • Chinese (Simplified): ...

Show less