1,506 language resources at your disposal
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
Latest Resources
Archives of "El Mundo" Newspaper – Year 2020
This corpus consists of 15,073 articles in Spanish from electronic archives of "El Mundo" Newspaper published in the year 2020. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. All articles ...
Archives of "El Mundo" Newspaper – Year 2022
This corpus consists of 16,124 articles in Spanish from electronic archives of "El Mundo" Newspaper published in the year 2022. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. All articles ...
Archives of "El Mundo" Newspaper – Years 2020-2022
This corpus consists of 45,658 articles in Spanish from electronic archives of "El Mundo" Newspaper between 2020 and 2022. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. The number of ...
Archives of "El Mundo" Newspaper – Year 2021
This corpus consists of 14,461 articles in Spanish from electronic archives of "El Mundo" Newspaper published in the year 2021. A few articles also come from publications from other related media: El Mundo Alicante, El Mundo Andalucía, El Mundo Baleares, El Mundo Catalunya, El Mundo Valéncia et Expansión. All articles ...
Vietnamese WordNet
Manual translation of the 2.1 version of the English WordNet into Vietnamese containing 211000 entries, in Excel format.
Idioms French-Vietnamese Dictionary
Idioms French-Vietnamese Dictionary with French terms translated in Vietnamese and one idiomatic sentence per Vietnamese word of 448 entries in XML format.
Chinese-Vietnamese - PhraseBank with audio files
Chinese-Vietnamese - PhraseBank with audio files of daily conversations spoken by native speakers containing 4002 sentence pairs. Scripts with Pinyin, Topic, Cat, Vietnamese translation with corresponding audio in Chinese and Vietnamese. Corpus in XML and WAV formats.
Vietnamese Etymology Dictionary
Vietnamese Etymology Dictionary containing Vietnamese terms with correspondence in Kanji + Exp with meaning and examples of 3100 entries, provided in XML format.
MGB-5 Moroccan Dialect
The MGB-5 Moroccan Dialect comprises 14 hours of Moroccan Arabic speech extracted from 93 YouTube videos distributed across seven genres: comedy, cooking, family/children, fashion, drama, sports, and science clips. Given that dialectal Arabic does not have a clearly defined orthography, different people tend to write the same word in slightly ...
CroaTPAS
CroaTPAS (Croatian Typed Predicate Argument Structures) is a bi-lingual lexicon in Croatian and English. It was created by manual annotation from the Croatian Web as Corpus and pattern creation using the Skema editor on the Sketch Engine platform. CroaTPAS is tailor-made to represent verb polysemy and currently contains a total ...
T-PAS
T-PAS (Typed Predicate Argument Structures) is a digital lexicon consisting of a corpus-derived collection of Italian verb argument structures, whose arguments have been manually annotated with a set of hierarchically organised semantic labels called Semantic Types. T-PAS is primarily tailored for investigating verbal polysemy, since each semantically typed verb argument ...
CALEM (Comprehensive Arabic LEMmas)
Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic lemmas and their corresponding inflected word forms (stems) with details (POS + Root). Each lexical entry represents a lemma followed by all its possible stems and each stem is enriched by its morphological features, especially the root and ...
MADED (Moroccan Arabic Dialect Electronic Dictionary)
Moroccan Arabic Dialect Electronic Dictionary (MADED) is an electronic lexicon containing almost 11,500 entries. They are written in Arabic script wherein each Modern Standard Arabic (MSA) lemma is provided with its corresponding Moroccan Arabic equivalent. In addition, MADED entries are annotated with useful metadata such as part-of-speech (POS), root and ...
MORV (Moroccan Morphological vocabulary)
The Moroccan Morphological vocabulary is a lexicon containing more than 4.6 M entries describing a given Moroccan Arabic word with fourteen (14) morphological and semantic features: the word orthographic form, the segmentation (prefix and suffix), part-of-speech (POS), gender, number, tense and transitivity (for verbs), its origin, dialectal lemma, Arabic lemma, ...
Learner Corpus of Portuguese L2 – COPLE2
The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is a corpus of written and oral texts produced by students of Portuguese as Foreign/Second Language courses in the Instituto de Cultura e Língua Portuguesa (the Institute of Portuguese Language and Culture) (ICLP – FLUL) and by applicants for examinations in ...
German Political Speeches Corpus
This corpus consists of a collection of political speeches in German crawled from the online archive of the German presidency (Bundespraësident) and the Chancellery (Bundesregierung). For the German Presidency the speeches are available from July 1, 1984 to February 17, 2012 and the corpus contains a total of 1,442 texts ...
ATCO2 Project Data
ATCO2 project aims at developing a unique platform allowing to collect, organize and pre-process air-traffic control (voice communication) data from air space. This project has received funding from the Clean Sky 2 Joint Undertaking (JU) under grant agreement No 864702. The JU receives support from the European Union’s Horizon 2020 ...
French Speech Data by Mobile Phone_Reading - 231 Hours
The data volume is 231 hours and is recorded by 406 speakers (from French, Canada, and Africa). The recording is in quiet environment and rich in content. It contains various fields like economics, entertainment, news, and spoken language. All texts are manually transcribed. The sentence accuracy rate is 95%. Format:16kHz, ...
Spanish Speech Data by Mobile Phone - 338 Hours
The 338-hour Spanish speech data and is recorded by 800 Spanish-speaking native speakers from Spain, Mexico, Argentina. The recording enviroment is quiet. All texts are manually transcribed.The sentence accuracy rate is 95%. It can be applied to speech recognition, machine translation, voiceprint recognition and so on. Format:16kHz, 16bit, uncompressed wav, ...
Hindi Speech Data by Mobile Phone - 759 Hours
The data is 759 hours long and was recorded by 1,425 Indian native speakers. The accent is authentic. The recording text is designed by language experts and covers general, interactive, car, home and other categories. The text is manually proofread, and the accuracy is high. Recording devices are mainstream Android ...
Indonesian Speech Data by Mobile Phone - 639 Hours
1285 Indonesian native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data ...
Singaporean Speaking English Speech Data by Mobile Phone - 201 Hours
This dataset is recorded by 452 native Singaporean speakers with a balanced gender. It is rich in content and it covers generic command and control, human-machine interaction, smart home command and control, in-car command and control categories. The transcription corpus has been manually proofread to ensure high accuracy. Format:16kHz, 16bit, ...
Mixed Speech with Chinese and English Data by Mobile Phone - 1,535 Hours
The data is recorded by 3972 Chinese native speakers with accents covering seven major dialect areas. The recorded text is a mixture of Chinese and English sentences, covering general scenes and human-computer interaction scenes. It is rich in content and accurate in transcription. It can be used for improving the ...
Wuhan Dialect Speech Data by Mobile Phone - 997 Hours
Mobile phone captured audio data of Wuhan dialect, 997 hours in total, recorded by more than 2,000 Wuhan dialect native speakers. The recorded text covers generic, interactive, on-board, home and other categories, with rich contents. Wuhan locals participate in quality check and proofreading. Sentence accuracy rate reaches 95 %; this ...
German Speech Data by Mobile Phone - 1,796 Hours
German audio data captured by mobile phone, consisting of 1,796 hours in total, recorded by 3,442 German native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data can be used for ...
Spanish Speech Data by Mobile Phone - 435 Hours
The data volumn is 435 hours and is recorded by 989 Spanish native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones. Format:16kHz, 16bit, uncompressed wav, ...
Sichuan Dialect Conversational Speech Data by Mobile Phone - 800 Hours
1730 Sichuan native speakers participated in the recording and face-to-face free talking in a natural way in wide fields without the topic specified. It is natural and fluency in speech, and in line with the actual dialogue scene. The speech was transcribed into text manually to ensure high accuracy. Format:16kHz, ...
Italian Speech Data by Mobile Phone_Reading - 215 Hours
Italian speech data (reading) is collected from 325 Italian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as econimics, entertainment, news, and oral. Each sentence contains 9.2 words in average. Each sentence is repeated 2.7 times on average. All texts ...
Chinese Speaking English Speech Data by Mobile Phone - 502 Hours
1,279 Chinese speakers from major dialect regions participated in the recording. It is in line with the specific accent of Chinese English speakers. The recorded script cover many categories such as spoken English, speech, and human-computer interaction, rich in content, extensive in fields, and balanced in phonemes. It can be ...
Chinese Speaking English Speech Data by Mobile phone - 593 Hours
This dataset is 100,000 colloquial English sentences recorded by 3,691 Chinese, covering many domestic dialect zones like Jiangsu, Shandong, Beijing, Henan, and meets the specific accent of Chinese speaking English. The recording texts contain commonly used sentences with rich contents, broad fields, and balanced phoneme. It can be used in ...
Mandarin Mobile Telephony Conversational Speech Collection Data - 2,657 Hours
4491 speakers participated in the recording and conducted face-to-face communication in a natural way. No topics are specified, with a wide range of fields; the voice was natural and fluent, in line with the actual dialogue scene. Text is transferred manually, with high accuracy. Format:16kHz, 16bit, uncompressed wav, mono channel ...
Japanese Speech Data By Mobile Phone - 474 Hours
The data were recorded by 1,245 native Japanese speakers. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high accuracy. Match mainstream Android, ...
Japanese Speech Data by Mobile Phone - 261 Hours
1006 Japanese native speakers participated in the recording, coming from eastern, western, and Kyushu regions, while the eastern region accounting for the largest proportion. The recording content is rich and all texts have been manually transcribed with high accuracy. Format:16kHz, 16bit, uncompressed wav, mono channel Recording environment:quiet indoor environment, without ...
British English Speech Data by Mobile Phone_Reading - 199 Hours
The data set contains 346 British English speakers' speech data, all of whom are English locals. Around 392 sentences of each speaker. The valid data is 199 hours. Recording environment is quiet. Recording contents contain various categories like economics, news, entertainment, commonly used spoken language, letter, figure, etc. Format:16kHz, 16bit, ...
Brazilian Portuguese Speech Data by Mobile Phone - 1,044 Hours
The data volumn is 1044 hours and is recorded by 2038 Brazilian native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones. Format:16kHz, 16bit, uncompressed wav, ...
Mandarin Heavy Accent Speech Data by Mobile Phone - 662 Hours
It collects 2,034 local Chinese from 26 provinces like Henan, Shanxi, Sichuan, Hunan, Fujian, etc. It is mandarin speech data with heavy accent. The recording contents are finance and economics, entertainment, policy, news, TV, and movies. Format:16kHz, 16bit, uncompressed wav, mono channel. Recording environment:1,288 people complete the recording in relatively ...
Cantonese Dialect Speech Data by Mobile Phone - 1,652 Hours
It collects 4,888 speakers from Guangdong Province and is recorded in quiet indoor environment. The recorded content covers 500,000 commonly used spoken sentences, including high-frequency words in weico and daily used expressions. The average number of repetitions is 1.5 and the average sentence length is 12.5 words. Recording devices are ...
Spanish Speech Data by Mobile Phone_R - 227 Hours
The data volumn is 227 hours. It is recorded by Spanish native speakers from Spain, Mexico and Venezuela. It is recorded in quiet environment. The recording contents cover various fields like economy, entertainment, news and spoken language. All texts are manually transcribed. The sentence accuracy is 95%. Format:16kHz, 16bit, uncompressed ...
Italian Speech Data by Mobile Phone - 1,441 Hours
The data were recorded by 3,109 native Italian speakers with authentic Italian accents. The recorded content covers a wide range of categories such as general purpose, interactive, in car commands, home commands, etc. The recorded text is designed by a language expert, and the text is manually proofread with high ...
Japanese Speaking English Speech Data by Mobile Phone - 207 Hours
400 native Japanese speakers involved, balanced for gender. The recording corpus is rich in content, and it covers a wide domain such as generic command and control category, human-machine interaction category, smart home category, in-car category. The transcription corpus has been manually proofread to ensure high accuracy. Format:16kHz, 16bit, uncompressed ...
Indian English Speech Data by Mobile Phone - 1,012 Hours
Indian English audio data captured by mobile phones, 1,012 hours in total, recorded by 2,100 Indian native speakers. The recorded text is designed by linguistic experts, covering generic, interactive, on-board, home and other categories. The text has been proofread manually with high accuracy; this data set can be used for ...
French Speech Data by Mobile Phone - 769 Hours
The data volumn is 769 hours and is recorded by 1623 French native speakers. The recording text is designed by linguistic experts, which covers general interactive, in-car and home category. The texts are manually proofread with high accuracy. Recording devices are mainstream Android phones and iPhones. Format:16kHz, 16bit, uncompressed wav, ...
Mandarin Conversational Speech Data by Mobile Phone and Voice Recorder - 1,351 Hours
1950 speakers participated in the recording, and conducted face-to-face communication in a natural way. They had free discussion on a number of given topics, with a wide range of fields. The voice was natural and fluent, in line with the actual dialogue scene. Text is transcribed manually, with high accuracy. ...
Spanish Speaking English Speech Data by Mobile Phone - 388 Hours
891 Spanish native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data ...
Latin American Speaking English Speech Data by Mobile Phone - 117 Hours
281 Latin American people recorded in a relatively quiet environment in authentic English. The recorded script is designed by linguists and covers a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. ...
Korean Speech Data by Mobile Phone_Reading - 197 Hours
It collects 291 Korean locals and is recorded in quiet indoor environment. The recordings include economics, entertainment, news, oral, figure, letter. 400 sentences for each speaker. Recording devices are mainstream Android phones and iPhones. Format:16kHz, 16bit, uncompressed wav, mono channel Recording environment:quiet indoor environment, without echo Recording content:economy, entertainment, news, ...
Changsha Dialect Speech Data by Mobile Phone - 997 Hours
2,000 Changsha natives participated in the recording, covering multiple age groups, with a balanced gender distribution and authentic accent. The recorded text is rich in content, covering general, interactive, car, home and other categories. Local people in changsha check and proofread. The accuracy of sentences is 95%. It is mainly ...
Chinese Digital Speech Data by Mobile Phone - 11,010 People
11,010 Chinese native speakers participated in the recording with equal gender. Each speaker reads 30 sentences of 4 -8 digit number. Format:16kHz, 16bit, uncompressed wav, mono channel Recording environment:quiet indoor environment, without echo Recording content (read speech):four to eight digital string Speaker:11,010 Chinese, 58% of which are female Device:Android mobile ...
Chinese Children Speaking English Speech Data by Mobile Phone - 464 Hours
Children read English audio data, covering ages from preschool (3-5 years old) to post-school (6-12 years old) , with children's speech features. Content accurately matches children's actual scenes of speaking English. It provides data support for children's smart home, automatic speech recognition and oral assessment in intelligent education scene. Format:16kHz,16bit, ...
Malay Speech Data by Mobile Phone - 370 Hours
675 Malaysians native speakers participated in the recording with authentic accent. The recorded script is designed by linguists and cover a wide range of topics including generic, interactive, on-board and home. The text is manually proofread with high accuracy. It matches with mainstream Android and Apple system phones. The data ...
Shanghai Dialect Speech Data by Mobile Phone - 1,030 Hours
It collects 2.956 speakers from Shanghai and is recorded in quiet indoor environment. The recorded content includes multi-domain customer consultation, short messages, numbers, Shanghai POI, etc. The corpus has no repetition and the average sentence length is 12.68 words. Recording devices are mainstream Android phones and iPhones. Format:16kHz, 16bit, uncompressed ...
Kunming Dialect Speech Data by Mobile Phone - 1,002 Hours
2,284 native speakers of Kunming dialect participated in the recording, with authentic accent and from multiple age groups. The recorded script covers a wide range of topics such as generic, interactive, on-board, and home. Local people in Kunming participated in quality check and proofreading, and the text was transcrit accurately. ...
Mandarin Speech Data by Mobile Phone - 2,028 Hours
4,787 Chinese native speakers participated in the recording with equal gender. Speakers are from various provinces of China. The recording content is rich, covering mobile phone voice assistant interaction, smart home command and control, In-car command and control, numbers, and other fields, which is accurately matching the smart home, intelligent ...
Indonesian Speech Data by Mobile Phone_R - 359 Hours
Indonesia speech data (reading) is collected from 496 Indonesian native speakers and is recorded in quiet environment. The recording is rich in content, covering multiple categories such as econimics, entertainment, news, figure, letter, and oral. Around 400 sentences for each speaker. The valid data volumn is 360 hours. All texts ...