1,413 language resources at your disposal
An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.
Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.
Latest Resources
Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian
The Corpus of Interactions between Seniors and an Empathic Virtual Coach in Spanish, French and Norwegian was built within the EMPATHIC project (Empathic, Expressive, Advanced Virtual Coach to Improve Independent Healthy-Life-Years of the Elderly), funded within the European Union’s Horizon 2020 Research and Innovation program. It consists of video recordings, ...
A Bilingual English-Ukrainian Lexicon of Named Entities Extracted from Wikipedia
The bilingual English-Ukrainian lexicon of named entities uses Wikipedia metadata as a source. The extracted named entity pairs are classified into five classes: PERSON, ORGANIZATION, LOCATION, PRODUCT, and MISC (miscellaneous). The lexicon consists of 624,168 pairs and comes in two formats: csv and xml.
Annotated tweet corpus in Arabizi, French and English
The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through ...
"La Dépêche de Kabylie" Corpus
"La Dépêche de Kabylie" Corpus consists of about 1,570,000 words in Amazigh language collected from the Algerian newspaper entitled “La Dépêche de Kabylie”. It was collected thanks to HTTrack Website Copier and contains about 90% of all entries of the Amazigh language. All articles are gathered under one plain text ...
English-Vietnamese Special Dictionary: Real Estate
English-Vietnamese Special Dictionary: Real Estate consists of 2,585 entries of real estate domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Law
English-Vietnamese Special Dictionary: Law consists of 3,011 entries of law domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Medical
English-Vietnamese Special Dictionary: Medical consists of 8,073 entries of medical domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Finance
English-Vietnamese Special Dictionary: Finance consists of 9,039 entries of finance domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Math
English-Vietnamese Special Dictionary: Math consists of 15,004 entries of math domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Navigation
English-Vietnamese Special Dictionary: Navigation consists of 19,393 entries of navigation domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Stocks
English-Vietnamese Special Dictionary: Stocks consists of 1,094 entries of stocks domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Economics
English-Vietnamese Special Dictionary: Economics consists of 16,255 entries of Economics domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Physics
English-Vietnamese Special Dictionary: Physics consists of 23,584 entries of physics domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Tourism
English-Vietnamese Special Dictionary: Tourism consists of 2,235 entries of tourism domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Aesthetic
English-Vietnamese Special Dictionary: Aesthetic consists of 836 entries of aesthetic domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Mechanical
English-Vietnamese Special Dictionary: Mechanical consists of 3,482 entries of mechanical domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Informatics
English-Vietnamese Special Dictionary: Informatics consists of 3,835 entries of informatics domain. It is provided in XML format.
English-Vietnamese Special Dictionary: Architecture
English-Vietnamese Special Dictionary: Architecture consists of 18,213 entries of architecture domain. It is provided in XML format.
Tham Khasi annotated corpus
The Tham Khasi annotated corpus is a Khasi corpus, an Austro-Asiatic language, comprising of Khasi sentences extracted from textbooks prescribed for students in secondary, higher secondary, graduation, and post-graduation in the year 2015-2016. In the corpus, each word is separated by a space and each sentence is marked with an ...
Parallel Corpora for 6 Indian Languages
The Parallel Corpora for 6 Indian Languages contains data sets for Bengali (540,000 words – 20,000 parallel sentences), Hindi (1,200,000 words – 37 000 parallel sentences), Malayalam (660,000 words – 29,000 parallel sentences), Tamil (747,000 words – 35,000 parallel sentences), Telugu (951,000 words – 43,000 parallel sentences), and Urdu (1,200,000 ...
English-Punjabi Code-Mixed Social Media Content
The English-Punjabi Code-Mixed Social Media Content corpus is composed is composed of 893,615 parallel sentences of English-Punjabi distributed over the following domains: - 82,341 parallel sentences of English-Punjabi code-mixed Agriculture Domain Data, - 59,158 parallel sentences of English-Punjabi code-mixed Culture Domain Data, - 101,732 parallel sentences of English-Punjabi code-mixed Entertainment ...
Danish Gigaword Corpus
The Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words. The general goals are to create a dataset that is: 1. representative; 2. accessible; 3. a suitable common starting point for Danish NLP models. The present version 1.0 was collected from various websites. Domains are ...
German-Vietnamese Dictionary
The German-Vietnamese Dictionary consists of 32,511 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language. Headword (in Vietnamese) has true voice by native speakers.
Vietnamese-French Dictionary
The Vietnamese-French Dictionary consists of 43,296 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
Vietnamese-German Dictionary
The Vietnamese-German Dictionary consists of 42,793 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language.
French-Vietnamese Dictionary
The French-Vietnamese Dictionary consists of 82,768 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples. All headwords are pronounced with true voice by native speakers. The dictionary is provided in XML format.
Ema-lon Manipuri Corpus (including word embedding and language model)
The Ema-lon Manipuri Corpus consists of a set of resources for Manipuri language (locally known as Meiteilon) for the purpose of machine translation. The main source for these resources is the Sangai Express news website. The resources that constitute the present corpus are listed below: 1. EM Corpus, abbreviation of ...
NRC Emotion Lexicon - Revised version
The NRC Emotion Lexicon was originally built by Saif M. Mohammad and Peter D. Turney through crowdsourcing. The NRC was created in order to assist with emotion analysis as other emotion lexicons were smaller at the time. In order to be able to fix this problem, Saif crowdsourced a huge ...
Persian Ezafe Construction Dataset
The Persian Ezafe Construction Dataset includes gold Ezafe tags in almost 30 thousand Persian sentences. The sentences were manually annotated by six annotators who where all native Persian speakers and linguists. The inter-annotator agreement of a small portion of the data (one thousand sentences) is 99.6%. Ezafe is an unstressed ...
English-Chinese-Vietnamese Trilingual Parallel Corpus
The English-Chinese-Vietnamese Trilingual Parallel Corpus consists of 20,046 trilingual sets of sentence pairs. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
Vietnamese-Korean Dictionary
The Vietnamese-Korean Dictionary consists of 27,449 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for the source language. The dictionary is provided in XML format.
Korean-Vietnamese Parallel Corpus
The Korean-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
Vietnamese-English Dictionary
The Vietnamese-English Dictionary consists of 156,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
English-Vietnamese Parallel Corpus
The English-Vietnamese Parallel Corpus consists of 1,000,000 sentence pairs, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
Japanese-Vietnamese Dictionary
The Japanese-Vietnamese Dictionary consists of 59,369 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for source language only. The dictionary is provided in XML format.
Chinese-Vietnamese Parallel Corpus
The Chinese-Vietnamese Parallel Corpus consists of 200,000 sentence pairs, with an average length of 15 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
English-Vietnamese Dictionary
The English-Vietnamese Dictionary consists of 125,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format.
Monolingual Vietnamese Annotated Corpus
The Monolingual Vietnamese Annotated Corpus consists of 100,000 sentences, manually annotated with word boundaries, POS, named entities, with an average length of 20 words per sentence. The corpus is provided in XML format and is annotated according to TEI-encoding guidelines.
Korean-Vietnamese Dictionary
The Korean-Vietnamese Dictionary consists of 37,678 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available only for source language. The dictionary is provided in XML format.
Chinese-Vietnamese Dictionary
The Chinese-Vietnamese Dictionary consists of 52,470 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples. The dictionary is provided in XML format.
Vietnamese-Japanese Dictionary
The Vietnamese-Japanese Dictionary consists of 65,000 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples available for source language only. The dictionary is provided in XML format.
Vietnamese-Chinese Dictionary
The Vietnamese-Chinese Dictionary consists of 50,911 entries containing the following information: phonetics (using IPA), morphology, grammar, semantics, pragmatics and examples for the source language only. The dictionary is provided in XML format.
Arabic Speech Corpus
This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was ...
Ahoslabi - esophageal speech database
Ahoslabi was built within the frame of the RESTORE project (“Restauración, almacenamiento y rehabilitación de la voz”) (restrictions apply) and has received funding from Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project, TEC2015-67163- C2-1-R), the Basque Government (PIBA-018-0035) and by the European Union's H2020 research and innovation ...
Japanese Kids Speech database (Lower Grade)
The Japanese Kids Speech database (Lower Grade) contains the total recordings of 179 Japanese Kids speakers (71 males and 108 females), from 6 to 9 years' old (first, second and third graders in elementary school), recorded in quiet rooms using smartphones. This database may be combined with the Japanese Kids ...
Japanese Kids Speech database (Upper Grade)
The Japanese Kids Speech database (Upper Grade) contains the total recordings of 232 Japanese Kids speakers (104 males and 128 females), from 9 to 13 years’ old (fourth, fifth and sixth graders in elementary school), recorded in quiet rooms using smartphones. This database may be combined with the Japanese Kids ...
CAREGIVER Corpus
A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording ...
MDT Mandarin Chinese Conversational Recognition Corpus – 2 channels
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits. For each conversation, there are two close-talking channels recorded via the microphones, ...
MDT Mandarin Chinese Conversational Recognition Corpus – 1 channel
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits. For each conversation, there are two close-talking channels recorded via the microphones, ...
MDT Mandarin Chinese Conversational Recognition Corpus – 3 channels
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits. For each conversation, there are two close-talking channels recorded via the microphones, ...
MDT Mandarin Chinese Conversational Recognition Corpus – Complete set
This dataset consists of 4.98 hours of transcribed conversational speech in Mandarin Chinese, where 30 conversations are uttered by 32 speakers (16 males and 16 females). The audios are sampled at 16 kHz and quantized at 16 bits. For each conversation, there are two close-talking channels recorded via the microphones, ...
Italian Speech Recognition Corpus (Desktop)
This corpus comprises 49,994 entries uttered by 50 speakers (23 males and 27 females), recorded over 2 channels (desktop in quiet office). Speech samples are stored as a sequence of 16-bit 48kHz for a total of 24.21hours of speech per channel.
Australian English Kids Speech Recognition Corpus (Desktop)
This corpus comprises 9,596 entries uttered by 30 speakers (15 males and 15 females), recorded over 2 channels (desktop in quiet office). Speech samples are stored as a sequence of 16-bit 44.1kHz for a total of 5 hours of speech per channel.
Canadian English Speech Recognition Corpus (Telephone) - sentences
This corpus comprises 1,500 entries uttered by 150 speakers (106 males and 44 females), recorded over the telephone network. Speech samples are stored as a sequence of 16-bit 8 kHz for a total of 2.09 hours of speech.