Chargement de la page... veuillez patienter!



Cette page ne s'affiche pas? Cliquez ici
 
ELRA ELRA
  Home Catalogue » Advanced Search » Search Results
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Products meeting the search criteria Products meeting the search criteria
    select distinct(ci.catalogue_item_id), ci.catalogue_item_reference from catalogue_items as ci, item_ressources as ir, resources as r where r.resource_id = ir.resource_id and ir.catalogue_item_id = ci.catalogue_item_id order by ci.catalogue_item_reference
    ELRA-AURORA-CD0002AURORA Project Database 2.0 - Evaluation Package
    The Aurora project 2.0 is a revised version of the Noisy TI digits database to follow on the work of ETSI. This CD set is a replacement for the previous set (version 1.0 consisted of 2 CDs while version 2.0 now consists of 4 CDs) . This database is intended for the evaluation of algorithms for front-end feature extraction algorithms in background noise but may also be used more widely by speech researchers to evaluate and compare the performance of noise robust speech recognition algorithms.

    ELRA-AURORA-CD0003-01AURORA Project database - Subset of SpeechDat-Car - Finnish database - Evaluation Package
    This database is a subset of the SpeechDat-Car database in Finnish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Finnish digits spoken in different driving conditions inside a car.

    ELRA-AURORA-CD0003-02AURORA Project database - Subset of SpeechDat-Car - Spanish database - Evaluation Package
    This database is a subset of the SpeechDat-Car database in Spanish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Spanish digits spoken in different noise and driving conditions inside a car.

    ELRA-AURORA-CD0003-03AURORA Project database - Subset of SpeechDat-Car - German database - Evaluation Package
    This database is a subset of the SpeechDat-Car database in German language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected German digits spoken in different noise and driving conditions inside a car.

    ELRA-AURORA-CD0003-04AURORA Project database - Subset of SpeechDat-Car - Danish database - Evaluation Package
    This database is a subset of the SpeechDat-Car database in Danish language which has been collected as part of the European Union funded SpeechDat-Car project. It contains isolated and connected Danish digits spoken in different noise and driving conditions inside a car.

    ELRA-AURORA-CD0003-05AURORA Project database - Subset of SpeechDat-Car - Italian database - Evaluation Package
    This database is a subset of the Italian SpeechDat-Car database which has been collected as part of the European Union funded SpeechDat-Car project. It contains contains 2200 Italian connected digit utterances divided into training and testing utterances in different noise and driving conditions inside a car.

    ELRA-AURORA-CD0004-01AURORA Project Database - Aurora 4a - Evaluation Package
    The Aurora project has released a number of list files for performing the training and testing on the Wall Street Journal (WSJ0) data at two sampling rates -8 kHz and 16 kHz. The Aurora 4a database is based on the WSJ0 with artificial addition of noise over a range of signal to noise ratios. It contains both clean and multicondition training sets and 14 evaluation sets with different noise types and microphones.

    ELRA-AURORA-CD0004-02AURORA Project Database - Aurora 4b - Evaluation Package
    The Aurora project has released a number of list files for performing the training and testing on the Wall Street Journal (WSJ0) data at two sampling rates -8 kHz and 16 kHz. The Aurora 4b, has been released. It contains noisy versions of the Nov'92 WSJ0 development set.

    ELRA-AURORA-CD0005AURORA-5
    The AURORA-5 database has been mainly developed to investigate the influence on the performance of automatic speech recognition for a hands-free speech input in noisy room environments. Furthermore two test conditions are included to study the influence of transmitting the speech in a mobile communication system.
    It contains artificially distorted versions of the recordings from adult speakers in the TI-Digits speech database downsampled at a sampling frequency of 8000 Hz, a set of recordings that contain sequences of digits uttered by different speakers in hands-free mode in a meeting room, as well as a set of scripts for running recognition experiments on those speech data. The experiments are based on the usage of the freely available software package HTK where HTK is not part of this resource.

    ELRA-B0002LusoLEX European Portuguese Lexicon
    LusoLEX:  Multifunctional monolingual lexicon of the European variety of Portuguese, consisting of about 61,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics, augmentatives and diminutives. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.

    ELRA-B0002LusoLEX European Portuguese Lexicon
    LusoLEX:  Multifunctional monolingual lexicon of the European variety of Portuguese, consisting of about 61,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics, augmentatives and diminutives. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.
    ELRA-B0002LusoLEX European Portuguese Lexicon
    LusoLEX:  Multifunctional monolingual lexicon of the European variety of Portuguese, consisting of about 61,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics, augmentatives and diminutives. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.
    ELRA-B0002BrasiLEX Brazilian Portuguese lexicon
    BrasiLEX:  Multifunctional monolingual lexicon of the Brazilian variety of Portuguese, consisting of about 65,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics and augmentative/diminutive degree. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.

    ELRA-B0003Austrian SpeechDat(AT) FDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the fixed telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-B0003Austrian SpeechDat(AT) FDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the fixed telephone network. Each speaker uttered around 60 read and spontaneous items.
    ELRA-B0003Austrian SpeechDat(AT) FDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the fixed telephone network. Each speaker uttered around 60 read and spontaneous items.
    ELRA-B0003Austrian SpeechDat(AT) MDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the Austrian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
    ELRA-B0004OrienTel Morocco MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
    ELRA-B0004OrienTel Morocco MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0004OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.
    ELRA-B0004OrienTel Morocco MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0004OrienTel Morocco MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 772 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel Tunisia MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0005OrienTel French as spoken in Tunisia database
    This speech database contains the recordings of 576 Tunisian speakers of French recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel Egypt MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0006OrienTel English as spoken in Egypt database
    This speech database contains the recordings of 500 Egyptian speakers of English recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-B0007IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
    The IDIOLOGOS 1 “Bootstrap” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 1000 adult French speakers (470 males, 530 females) recorded over the French fixed telephone network.

    ELRA-B0007IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
    The IDIOLOGOS 1 “Bootstrap” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 1000 adult French speakers (470 males, 530 females) recorded over the French fixed telephone network.
    ELRA-B0007IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
    The IDIOLOGOS 1 “Bootstrap” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 1000 adult French speakers (470 males, 530 females) recorded over the French fixed telephone network.
    ELRA-B0007IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
    The IDIOLOGOS 2 “Eingenspeakers” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 200 adult French speakers (97 males, 103 females) recorded over the French fixed telephone network.

    ELRA-B0008LC-STAR Spanish phonetic lexicon
    The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of 55,854 common words, a set of 45,403 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-B0008LC-STAR Spanish phonetic lexicon
    The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of 55,854 common words, a set of 45,403 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
    ELRA-B0008LC-STAR Spanish phonetic lexicon
    The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of 55,854 common words, a set of 45,403 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
    ELRA-B0008LC-STAR Catalan phonetic lexicon
    The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of 53,225 common words, a set of 45,306 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-B0009TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech
    This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format.

    For corresponding recordings, see ELRA-S0251

    ELRA-B0009TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech
    This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format.

    For corresponding recordings, see ELRA-S0251
    ELRA-B0009TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech
    This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format.

    For corresponding recordings, see ELRA-S0251
    ELRA-B0009TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech
    This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.

    For corresponding transcriptions, see ELRA-S0249.

    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel Jordan MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.
    ELRA-B0011OrienTel English as spoken in Jordan database
    This speech database contains the recordings of 578 Jordanian speakers of English recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.

    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.
    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.
    ELRA-B0012CHIL 2005 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Contents of the CHIL 2004 Evaluation Package (see catalogue reference ELRA-E0009 for description).
    2) Audio and Video Recordings: 5 seminars recorded in November 2004).
    3) Stereo Video Recordings of 10 subjects that move in the camera’s field of view while performing pointing gestures.
    2) Video annotations.
    3) Transcriptions.

    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.
    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.
    ELRA-B0012CHIL 2005 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Contents of the CHIL 2004 Evaluation Package (see catalogue reference ELRA-E0009 for description).
    2) Audio and Video Recordings: 5 seminars recorded in November 2004).
    3) Stereo Video Recordings of 10 subjects that move in the camera’s field of view while performing pointing gestures.
    2) Video annotations.
    3) Transcriptions.
    ELRA-B0012CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.
    ELRA-B0012CHIL 2005 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Contents of the CHIL 2004 Evaluation Package (see catalogue reference ELRA-E0009 for description).
    2) Audio and Video Recordings: 5 seminars recorded in November 2004).
    3) Stereo Video Recordings of 10 subjects that move in the camera’s field of view while performing pointing gestures.
    2) Video annotations.
    3) Transcriptions.
    ELRA-B0012CHIL 2006 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The CHIL 2006 Evaluation Package consists of:
    1) A set of audiovisual recordings of seminars, called non-interactive seminars and of highly-interactive small working groups’ seminars, called interactive seminars. The recordings were done between 2004 and 2005 according to the “CHIL Room Setup” specification.
    2) Video annotations.
    3) Orthographic transcriptions.

    ELRA-B0013TC-STAR Spanish Baseline Female Speech Database
    This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.

    ELRA-B0013TC-STAR Spanish Baseline Female Speech Database
    This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.
    ELRA-B0013TC-STAR Spanish Baseline Female Speech Database
    This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.
    ELRA-B0013Spanish Festival voice female
    This database contains a unit-selection voice (clunits technology) for their use in Festival Synthesis System (tested on version 2.0.95:beta April 2010). The voice was built using a subset of speech derived from the TC-STAR Spanish Baseline Female Speech Database: mid distance microphone, 4h25m, 16kHz, 16bits. The database was created within the scope of the METANET4U project funded by the European Commission.

    ELRA-B0014TC-STAR Spanish Baseline Male Speech Database
    This database contains the recordings of one male Spanish speaker recorded simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal in a noise-reduced room. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Female Speech Database is also available via ELRA under reference ELRA-S0309.

    ELRA-B0014TC-STAR Spanish Baseline Male Speech Database
    This database contains the recordings of one male Spanish speaker recorded simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal in a noise-reduced room. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Female Speech Database is also available via ELRA under reference ELRA-S0309.
    ELRA-B0014TC-STAR Spanish Baseline Male Speech Database
    This database contains the recordings of one male Spanish speaker recorded simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal in a noise-reduced room. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Female Speech Database is also available via ELRA under reference ELRA-S0309.
    ELRA-B0014Spanish Festival voice male
    This database contains a unit-selection voice (clunits technology) for their use in Festival Synthesis System (tested on version 2.0.95:beta April 2010). The voice was built using a subset of speech derived from the TC-STAR Spanish Baseline Male Speech Database: mid distance microphone, 2h26m, 16kHz, 16bits. The database was created within the scope of the METANET4U project funded by the European Commission.

    ELRA-B0015MEDIA Evaluation Package
    The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.
    This package includes the material that was used for the MEDIA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.

    ELRA-B0015MEDIA Evaluation Package
    The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.
    This package includes the material that was used for the MEDIA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.
    ELRA-B0015MEDIA Evaluation Package
    The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.
    This package includes the material that was used for the MEDIA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.
    ELRA-B0015PortMedia French and Italian corpus
    This corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker). The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation. A manual transcription and semantic annotation of the corpus are provided with corresponding wave files.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of participles (MACPLEX_ADJPARTIC)
    This lexicon contains 19,552 lemmas and 1,251,328 word forms. The lemmas are derived from verbs. The lexicon is available in Unicode.

    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of participles (MACPLEX_ADJPARTIC)
    This lexicon contains 19,552 lemmas and 1,251,328 word forms. The lemmas are derived from verbs. The lexicon is available in Unicode.
    ELRA-B0016Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of participles (MACPLEX_ADJPARTIC)
    This lexicon contains 19,552 lemmas and 1,251,328 word forms. The lemmas are derived from verbs. The lexicon is available in Unicode.
    ELRA-B0016Macedonian lexicon of compound words (MACPLEX_COMP)
    This lexicon contains 784 lemmas and 6,289 word forms (576 nouns, 25 adjectives, 73 adverbs, 66 interjections, 17 numerals, 15 pronouns and 12 residuals). The lexicon is available in Unicode.

    ELRA-E0002TC-STAR 2005 Evaluation Package - ASR English
    This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the English language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0003TC-STAR 2005 Evaluation Package - ASR Spanish
    This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Spanish language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0004TC-STAR 2005 Evaluation Package - ASR Mandarin Chinese
    This package includes the material used for the TC-STAR 2005 Automatic Speech Recognition (ASR) first evaluation campaign for the Mandarin Chinese language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0005TC-STAR 2005 Evaluation Package - SLT English-to-Spanish
    This package includes the material used for the TC-STAR 2005 Spoken Language Translation (SLT) first evaluation campaign for English-to-Spanish translation.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0006TC-STAR 2005 Evaluation Package - SLT Spanish-to-English
    This package includes the material used for the TC-STAR 2005 Spoken Language Translation (SLT) first evaluation campaign for Spanish-to-English translation.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0007TC-STAR 2005 Evaluation Package - SLT Chinese-to-English
    This package includes the material used for the TC-STAR 2005 Spoken Language Translation (SLT) first evaluation campaign for Chinese-to-English translation.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0008The CLEF Test Suite for the CLEF 2000-2003 Campaigns – Evaluation Package
    The CLEF Test Suite contains the data used for the main tracks of the CLEF campaigns carried out from 2000 to 2003: Multilingual text retrieval, Bilingual text retrieval, Monolingual text retrieval, and Domain-specific text retrieval. It contains multilingual corpora in English, French, German, Italian, Spanish, Dutch, Swedish, Finnish, Russian, and Portuguese.

    ELRA-E0009CHIL 2004 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Audio and Video Recordings of 10 seminars
    2) Video annotations done displaying 1 over 10 pictures in sequence, for the 4 cameras.
    3) Transcriptions using both TRS and STMUID formats.

    ELRA-E0010CHIL 2005 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The database consists of:
    1) Contents of the CHIL 2004 Evaluation Package (see catalogue reference ELRA-E0009 for description).
    2) Audio and Video Recordings: 5 seminars recorded in November 2004).
    3) Stereo Video Recordings of 10 subjects that move in the camera’s field of view while performing pointing gestures.
    2) Video annotations.
    3) Transcriptions.

    ELRA-E0011TC-STAR 2006 Evaluation Package - ASR English
    This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the English language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0012TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
    This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Spanish language within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0012-01TC-STAR 2006 Evaluation Package - ASR Spanish - CORTES
    This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Spanish language within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0012-02TC-STAR 2006 Evaluation Package - ASR Spanish - EPPS
    This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Spanish language within the EPPS task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0013TC-STAR 2006 Evaluation Package - ASR Mandarin Chinese
    This package includes the material used for the TC-STAR 2006 Automatic Speech Recognition (ASR) second evaluation campaign for the Mandarin Chinese language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0014TC-STAR 2006 Evaluation Package - SLT English-to-Spanish
    This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for English-to-Spanish translation. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0015TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
    This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0015-01TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - CORTES
    This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0015-02TC-STAR 2006 Evaluation Package - SLT Spanish-to-English - EPPS
    This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Spanish-to-English translation within the EPPS task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0016TC-STAR 2006 Evaluation Package - SLT Chinese-to-English
    This package includes the material used for the TC-STAR 2006 Spoken Language Translation (SLT) second evaluation campaign for Chinese-to-English translation.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0017CHIL 2006 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The CHIL 2006 Evaluation Package consists of:
    1) A set of audiovisual recordings of seminars, called non-interactive seminars and of highly-interactive small working groups’ seminars, called interactive seminars. The recordings were done between 2004 and 2005 according to the “CHIL Room Setup” specification.
    2) Video annotations.
    3) Orthographic transcriptions.

    ELRA-E0018ARCADE II Evaluation Package
    The ARCADE II Evaluation Package was produced within the French national project ARCADE II (Evaluation of parallel text alignment systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ARCADE II project enabled to carry out a campaign for the evaluation in the field of multilingual alignment.
    This package includes the material that was used for the ARCADE II evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.
    The campaign is distributed over two actions: sentence alignment and translation of named entities.

    ELRA-E0019CESART Evaluation Package
    The CESART Evaluation Package was produced within the French national project CESART (Evaluation of terminology extraction tools), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESART project enabled to carry out a campaign for the evaluation of terminological resources acquisition tools.
    This package includes the material that was used for the CESART evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.
    The campaign is distributed over two actions: term extraction and relation extraction.

    ELRA-E0020CESTA Evaluation Package
    The CESTA Evaluation Package was produced within the French national project CESTA (Evaluation of MT systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The CESTA project enabled to carry out a campaign for the evaluation of machine translation technologies.
    This package includes the material that was used for the CESTA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system.
    The campaign is distributed over two actions: evaluation on a non restrictive vocabulary, evaluation on a specialised domain (evaluation after terminology enrichment).

    ELRA-E0021ESTER Evaluation Package
    The ESTER Evaluation Package was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French.
    This package includes the material that was used for the ESTER evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over three actions: orthographic transcription, segmentation and information extraction (named entity tracking).
    For research or commercial use, please refer to ELRA-S0241 ESTER Corpus.

    ELRA-E0022EQueR Evaluation Package
    The EQueR Evaluation Package was produced within the French national project EQueR (Evaluation campaign for Question-Answering systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The EQueR project enabled to carry out a campaign for the evaluation of Question-Answering systems in French.
    This package includes the material that was used for the EQueR evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over two actions: one generic task and one specialised task (medical domain).

    ELRA-E0023EvaSy Evaluation Package
    The EvaSy Evaluation Package was produced within the French national project EvaSy (Evaluation of speech synthesis systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The EvaSy project enabled to carry out a campaign for the evaluation of speech synthesis systems using French text data.
    This package includes the material that was used for the EvaSy evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over three actions: evaluation of grapheme-to-phoneme conversion, evaluation of prosody, global evaluation of the quality of speech synthesis systems.

    ELRA-E0024MEDIA Evaluation Package
    The MEDIA Evaluation Package was produced within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The MEDIA project enabled to carry out a campaign for the evaluation of man-machine dialogue systems for French.
    This package includes the material that was used for the MEDIA evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.
    The campaign is distributed over two actions: an evaluation taking into account the dialogue context and an evaluation not taking into account the dialogue context.

    ELRA-E0025TC-STAR 2007 Evaluation Package - ASR English
    This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the English language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0026-01TC-STAR 2007 Evaluation Package - ASR Spanish - CORTES
    This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the Spanish language within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0026-02TC-STAR 2007 Evaluation Package - ASR Spanish - EPPS
    This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the Spanish language within the EPPS task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0027TC-STAR 2007 Evaluation Package - ASR Mandarin Chinese
    This package includes the material used for the TC-STAR 2007 Automatic Speech Recognition (ASR) third evaluation campaign for the Mandarin Chinese language.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0028TC-STAR 2007 Evaluation Package - SLT English-to-Spanish
    This package includes the material used for the TC-STAR 2007 Spoken Language Translation (SLT) third evaluation campaign for English-to-Spanish translation. It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0029-01TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - CORTES
    This package includes the material used for the TC-STAR 2007 Spoken Language Translation (SLT) third evaluation campaign for Spanish-to-English translation within the CORTES task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0029-02TC-STAR 2007 Evaluation Package - SLT Spanish-to-English - EPPS
    This package includes the material used for the TC-STAR 2007 Spoken Language Translation (SLT) third evaluation campaign for Spanish-to-English translation within the EPPS task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0030TC-STAR 2007 Evaluation Package - SLT Chinese-to-English
    This package includes the material used for the TC-STAR 2007 Spoken Language Translation (SLT) third evaluation campaign for Chinese-to-English translation.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0031TC-STAR 2006 Evaluation Package – End-to-End
    This package includes the material used for the TC-STAR 2006 evaluation campaign within the end-to-end task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0032TC-STAR 2007 Evaluation Package – End-to-End
    This package includes the material used for the TC-STAR 2007 evaluation campaign within the end-to-end task.
    It includes resources, protocols, scoring tools, results of the official campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-E0033CHIL 2007 Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The CHIL 2007 Evaluation Package consists of:
    1) A set of audiovisual recordings of interactive seminars. The recordings were done between June and September 2006 according to the “CHIL Room Setup” specification.
    2) Video annotations.
    3) Orthographic transcriptions.

    ELRA-E0034EASy Evaluation Package
    The EASy Evaluation Package was produced within the French national project EASy (Evaluation of syntactic parsers of French), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The project enabled to carry out a campaign for the evaluation of syntactic parsers of French. This package includes the material that was used for the EASy evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself. The campaign is distributed over two actions: evaluation of constituent and dependency relation annotations.

    ELRA-E0035DEFT'08 Evaluation Package
    DEFT (DEfi Fouille de Texte – Text Mining Challenge) organizes evaluation campaigns in the field of text mining. The topic of DEFT 2008 edition is related to the classification of texts by topics and genres. DEFT’08 Evaluation Package enables to compare two corpora with different genres (a newspaper article corpus extracted from Le Monde newspaper and a corpus of encyclopaedic articles extracted from the internet free encyclopaedia, Wikipedia) on the basis of the same set of pre-defined categories.

    ELRA-E0036CLEF AdHoc-News Test Suites (2004-2008) – Evaluation Package
    The CLEF AdHoc-News Test Suites (2004-2008) contain the data used for the main AdHoc track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual news collections.

    ELRA-E0037CLEF Domain Specific Test Suites (2004-2008) – Evaluation Package
    The CLEF Domain Specific Test Suites (2004-2008) contain the data used for the Domain Specific track of the CLEF campaigns carried out from 2004 to 2008. This track tested the performance of monolingual, bilingual and multilingual Information Retrieval (IR) systems on multilingual collections of scientific articles.

    ELRA-E0038CLEF Question Answering Test Suites (2003-2008) – Evaluation Package
    The CLEF Question Answering Suites (2003-2008) contain the data used for the Question Answering (QA) track of the CLEF campaigns carried out from 2003 to 2008. This track tested the performance of monolingual, bilingual and multilingual Question Answering systems on multilingual collections of news documents.

    ELRA-E0039CLEF QAST (2007-2009) – Evaluation Package
    The CLEF QAST (2007-2009) contains the data used for the Question Answering on Speech Transcripts tracks of the CLEF campaigns carried out from 2007 to 2009. These tracks tested the performance of monolingual Question Answering systems on collections of audio transcriptions.

    ELRA-E0040MEDAR Evaluation Package
    The MEDAR Evaluation Package was produced within the project MEDAR (MEDiterranean ARabic language and speech technology), supported by the European Commission's ICT programme. It aims to enable the evaluation of SLT /MT (Machine Translation) systems for translation tasks applying to the English-to-Arabic direction.

    ELRA-E0041CHIL 2007+ Evaluation Package
    The CHIL Seminars are scientific presentations given by students, faculty members or invited speakers in the field of multimodal interfaces and speech processing. The language is European English spoken by non native speakers. The recordings comprise the following: videos of the speaker and the audience from 4 fixed cameras, frontal close ups of the speaker, close talking and far-field microphone data of the speaker’s voice and background sounds.

    The CHIL 2007+ Evaluation Package includes: 1) CHIL 2007 Evaluation Package (see ELRA-E0033) and 2) additional annotations which have been created within the scope of the Metanet4u Project (ICT PSP No 270893), sponsored by the European Commission.

    ELRA-E0042CLEFeHealth 2013 Task 3 Evaluation Package
    The CLEFeHealth 2013 Task 3 Evaluation Package contains data used for the User-centred health information retrieval Shared task at the CLEFeHealth Lab conducted in 2013. Task 3 aimed at evaluating information retrieval to address questions patients may have when reading clinical reports.

    ELRA-E0043CLEFeHealth 2014 Task 3 Evaluation Package
    The CLEFeHealth 2014 Task 3 Evaluation Package contains data used for the User-centred health information retrieval Shared task at the CLEFeHealth Lab conducted in 2014. Task 3 aimed at evaluating information retrieval to address questions patients may have when reading clinical reports.

    ELRA-E0044REPERE Evaluation Package
    The REPERE Evaluation Package contains the visual annotation of 60 hours of French news TV shows, for the purpose of person recognition within TV programs. This annotation concerns both persons and written information appearing on screen. Provided data consists of:
    - video files with indexes and with manual transcriptions in XGTF format (Viper),
    - audio files compressed in WAV format with transcriptions in TRS format (Transcriber).

    ELRA-E0045MAURDOR Evaluation Package
    The MAURDOR project consists in evaluating systems for automatic processing of written documents. Collected written documents are scanned documents (printed, typewritten or manuscripts). This package contains 8,129 documents. Once collected, those documents were submitted to a manual annotation. This package contains the material provided to the evaluation campaign participants:
    - Consistent development and test data corresponding to the application concerned;
    - Tools for the automatic measurement of system performances;
    - A common assessment protocol applicable to each processing stage, along with a complete automatic processing chain for written documents.
    The documents are provided in TIFF format and the annotations are provided in XML format.

    ELRA-E0046ETAPE Evaluation Package
    The ETAPE Evaluation Package consists of ca. 30 hours of radio and TV data, selected to include mostly non planned speech and a reasonable proportion of multiple speaker data. All data were carefully transcribed, including named entity annotation.
    This package includes the material that was used for the ETAPE evaluation campaign. It includes resources, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of this evaluation package is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself.

    ELRA-L0000DST German lexicon

    ELRA-L0001DICO-MORPH_Lemme
    Lexicon for morphological works of over 400,000 French entries divided into 55,000 nouns, 8,000 verbs, 16,850 adjectives, 2,000 adverbs.

    ELRA-L0002DICO-MORPH_Collocation
    Collocation lexicon. Up to 35,000 entries in French. An adding to ELRA-L0001.

    ELRA-L0003DICO-SYNT
    90,000 French inflectional forms divided into 25,000 nouns, 8,000 verbs that generate 25,000 model verbs, 1,000 adjectives, 1,500 adverbs. Morphosyntactical information in addition to L0001.

    ELRA-L0004Dutch Lexicon
    64,000 entries from general vocabulary divided into 50,000 nouns, 7,000 verbs, 6,000 adjectives, 1,000 adverbs. Morphological, syntactical & semantic information.

    ELRA-L0005French Lexicon
    50,000 entries from general vocabulary divided into 36,000 nouns, 6,000 verbs, 7,000 adjectives, 1,000 adverbs. Morphological, syntactical & semantic information.

    ELRA-L0006ILC Italian Morphological Lexicon
    Set of lemmas/lexical entries (about 60,000) with the corresponding inflected word-forms, and a morphological engine for morphological analysis and generation.

    ELRA-L0007LexIn 2:e Swedish Lexicon
    Lexicon
    28,000 headwords and 21,000 senses

    ELRA-L0008Monolingual Danish Lexicon
    25,000 entries. Each lexeme contains the word class, inflection, semantic features, syntactical frames (for verbs), and complement (for nouns & adj.).

    ELRA-L0009Monolingual Portuguese Lexicon
    60,000 entries with morphological information, plus a software engine for generating inflected forms.

    ELRA-L0010MULTEXT Lexicons
    This CD-ROM contains a set of lexicons developed in the MULTEXT project financed by the European Commission (LRE 62-050). The set contains the following languages:
    English: 66,214 Word forms
    French: 306,795 Word forms
    German: 233,861 Word forms
    Italian: 145,530 Word forms
    Spanish: 510,710 Word forms

    ELRA-L0012Spanish gilcUB-M Dictionary
    60,000 lemmas of general vocabulary with morphosyntactical information (9,700 verbs, 35,500 nouns, 14,300 adjectives & 120 adverbs) plus 10,000 full-form adverbs.

    ELRA-L0013-01THAMUS Generic Italian Dictionary - canonical forms
    A Generic monolingual Italian dictionary of 87,000 canonical forms. Multi-word terms contain morphological coding for the headword.

    ELRA-L0013-02THAMUS. Generic Italian Dictionary - inflected forms
    A Generic monolingual Italian dictionary of 612,000 inflected forms. Multi-word terms contain morphological coding for the headword.

    ELRA-L0013-03THAMUS. Generic Italian Dictionary - canonical forms - technical domain
    A Generic monolingual Italian dictionary of 48,000 canonical forms (Technical). Multi-word terms contain morphological coding for the headword.

    ELRA-L0013-04THAMUS. Generic Italian Dictionary - inflected forms - technical domain
    A Generic monolingual Italian dictionary of 96,000 inflected forms (Technical). Multi-word terms contain morphological coding for the headword.

    ELRA-L0014Adverbial Equivalence Dictionary
    1,200 entries of simplified equivalents for French fixed expressions (“laid comme un crapaud” has equivalent "tres laid").

    ELRA-L0015Nominalisation Dictionary
    2,300 entries consisting of substantives of French verbs.

    ELRA-L0016Tri-, quadri-, pentagrams dictionaries
    The dictionaries consist of a list of 5,487 sequences of 3, 4 or 5 characters which follow each other in French language words. In particular, they enable to locate misspelt sequences.

    ELRA-L0017"N de N" Dictionary
    Generic dictionary. 21,000 entries of French uninflected noun phrases classified in 1,000 human entries, 4,200 concrete entries, 6,000 abstract entries.

    ELRA-L0018German lexicon
    466,300 entries with a list of inflected words (97,000 nouns, 236,200 verbs, 130,500 adjectives/adverbs, 1,700 grammatical words, 40 punctuations, 400 prefixes, 370 suffixes).

    ELRA-L0019English lexicon
    160,000 entries with a list of inflected words derived from 93,500 nouns, 35,800 verbs, 46,600 adjectives, 8,865 grammatical words.

    ELRA-L0020-01DST Dictionary - String Dictionary
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-02DST Dictionary - Part of Speech (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-03DST Dictionary - Gender, number, conjugation (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-04DST Dictionary - Lemma (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-05DST Dictionary - Semantical information (optional)
    550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-06DST Dictionary - Syntactical information (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-07DST Dictionary - Prep./Adv. phrases (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-08DST Dictionary - Compound nouns (optional)
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0020-09DST Dictionary - The whole dictionary
    DST:  550,000 inflected forms in French (43,000 common nouns, 10,938 proper nouns, 19,500 adjectives, 8,150 nouns-adjectives, 6,800 verbs, 6,200 compound nouns, etc.). Syntactical, semantic, lexicological information.
    The DST is distributed in different sub-sets:
    L0020-01 String dictionary
    L0020-02 Part of speech (optional)
    L0020-03 Gender, number, conjugation (optional)
    L0020-04 Lemma (optional)
    L0020-05 Semantical information (optional)
    L0020-06 Syntactical information (optional)
    L0020-07 Prep/adv. phrases (optional)
    L0020-08 Compound nouns (optional)
    L0020-09 The whole dictionary

    ELRA-L0021Dictionary of French verbs (SINEQUA - Jean Dubois)
    25,610 verbs with usage domains, level of language, conjugation, auxiliary, verbal adjectives in -able, -ant or -e, encoded syntactical constructions, sample phrases, synonyms, operators enabling semantic-syntactic classification, encoding of derived forms in -age, -ment, -tion, -oir, -ure, deverbal nouns, base words from which verbs can be derived, a scale of usage ranging from 1 to 6.

    ELRA-L0022Dictionary of words (SINEQUA - Jean Dubois)
    126,844 French words with usage domains, grammatical category (gender, number, uncountable, collective, adjectival, nominal, verbal, adverbial derived forms).

    ELRA-L0023Dictionary of affixes (SINEQUA - Jean Dubois)
    4,286 suffixes and prefixes, plus information on their verbal, nominal or adjectival bases or on the verbal basis of greco-latin items.

    ELRA-L0024Dictionary of verb phrases (SINEQUA - Jean Dubois)
    3,480 entries based on the model of the dictionary of French verbs (ELRA-L0021).

    ELRA-L0025Dictionary of invariable forms and phrases (SINEQUA - Jean Dubois)
    4,783 entries based on the model of the dictionary of words (ELRA-L0022).

    ELRA-L0026Dictionary of exclamatory stereotyped phrases (SINEQUA - Jean Dubois)
    1,901 entries based on the model of the dictionary of invariable forms and phrases (ELRA-L0025).

    ELRA-L0027Dictionary of French local authorities (SINEQUA - Jean Dubois)
    38,965 entries in lower cases with accents, controlled on the guide Michelin, without localities.

    ELRA-L0028Dictionary of noun phrases and plural-only words (SINEQUA - Jean Dubois)
    2,138 compound names and 1,397 entries of plural-only words.

    ELRA-L0029-01CELEX Dutch lexical database - Complete set
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Morphology Infl.;
    L0029-05 Subset Morphology Der.;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-02CELEX Dutch lexical database - Orthography Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Morphology Infl.;
    L0029-05 Subset Morphology Der.;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-03CELEX Dutch lexical database - Phonology Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Morphology Infl.;
    L0029-05 Subset Morphology Der.;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-04CELEX Dutch lexical database - Inflectional Morphology Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Inflectional Morphology;
    L0029-05 Subset Derivational Morphology;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-05CELEX Dutch lexical database - Derivational Morphology Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Inflectional Morphology;
    L0029-05 Subset Derivational Morphology;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-06CELEX Dutch lexical database - Syntax Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Morphology Infl.;
    L0029-05 Subset Morphology Der.;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0029-07CELEX Dutch lexical database - Frequency Subset
    Dutch lexical database containing lemmas (124136 entries), wordforms (381292 entries), abbreviations (1622 entries), syllables (31358 entries).
    The database is divided into different subsets:
    L0029-01 Complete set of data;
    L0029-02 Subset Orthography;
    L0029-03 Subset Phonology;
    L0029-04 Subset Morphology Infl.;
    L0029-05 Subset Morphology Der.;
    L0029-06 Subset Syntax;
    L0029-07 Subset Frequency.

    ELRA-L0030Bulgarian Morphological Dictionary
    67,500 entries divided into 242 inflectional types (including proper nouns), morphosyntactic information for each entry, and a morphological engine (MS DOS and WINDOWS 95/NT) for morphological analysis and generation.

    ELRA-L0031Dutch PAROLE lexicon
    The entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax, according to the specifications of the PAROLE project. The lexicon is set up as an SGML file.

    ELRA-L0032PAROLE Greek Lexicon
    The PAROLE Greek lexicon has two layers, morphological and syntactic. It includes the most frequent words found in a 9 million word corpus, coded according to the PAROLE specifications. The Morphological layer contains a total of 20149 Morphological units. The Syntactic layer contains 25092 Syntactic units.

    ELRA-L0033LusoLEX European Portuguese Lexicon
    LusoLEX:  Multifunctional monolingual lexicon of the European variety of Portuguese, consisting of about 61,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics, augmentatives and diminutives. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.

    ELRA-L0034BrasiLEX Brazilian Portuguese lexicon
    BrasiLEX:  Multifunctional monolingual lexicon of the Brazilian variety of Portuguese, consisting of about 65,000 entries (lemmas) and 1,600 correspondent inflexion paradigms. The set of entries includes compound words and the inflexion paradigms include information regarding enclitics and augmentative/diminutive degree. Morphological information is encoded with maximum granularity and is conformant with the EAGLES recommendations.

    ELRA-L0035PAROLE Portuguese Lexicon
    The PAROLE Portuguese Lexicon is constituted by 20 thousand entries morpho-syntactically and syntactically encoded, accordingly to the parole common encoding standards. The data is in SGML format.

    ELRA-L0036Japanese Word Dictionary
    The Japanese Word Dictionary is composed of 260,000 Japanese word records arranged alphabetically according to the Japanese syllabary.

    ELRA-L0037English Word Dictionary
    The English Word Dictionary, composed of 190,000 English word records arranged alphabetically.

    ELRA-L0038Concept Dictionary
    The Concept Dictionary, which provides 400,000 concepts that are made reference to in the Japanese and English Word Dictionaries (ref. ELRA-L0036 and L0037), the Japanese-English and English-Japanese Bilingual Dictionaries (ref. ELRA-M0023 and M0024) as well as in the Japanese and English Co-occurrence Dictionaries (ref. ELRA-L0039 and L0040). The Concept Dictionary is composed of three separate dictionaries:
    - the Headconcept Dictionary gives a description of each concept in words
    - the Concept Classification Dictionary contains a classification of concepts that have a super-sub relation
    - the Concept Description Dictionary provides all other information regarding the relation between concepts.

    ELRA-L0039Japanese Co-occurrence Dictionary
    The Japanese Co-occurrence Dictionary, composed of 900,000 headphrase notations arranged according to the Japanese syllabary. Appendix to the Japanese Co-occurrence Dictionary: The Japanese Corpus

    ELRA-L0040English Co-occurrence Dictionary
    The English Co-occurrence Dictionary (ref. ELRA-L0040), composed of 460,000 alphabetically arranged of headphrases. Appendix to the English Co-occurrence Dictionary: The English Corpus.

    ELRA-L0041Technical Terms Dictionary (Information processing)
    The Technical Terms Dictionary (Information processing) contains 80,000 technical terms in English and 120,000 technical terms in Japanese from the field of information processing.

    ELRA-L0042PAROLE Spanish Lexicon
    The PAROLE Spanish lexicon follows standard PAROLE architecture. It contains about 22,000 morphological units, of which 12,209 are common nouns, 3,367 verbs, 4,996 adjectives.

    ELRA-L0043PAROLE English lexicon
    The PAROLE English lexicon consists of 22 000 morphological units extracted from the CRL-LKB and COBUILD dictionaries: 12998 are common nouns, 40 proper nouns, 4195 verbs, 3208 adjectives, 606 adverbs, 71 adpositions, 2 articles, 21 conjunctions, 25 determiners and 53 pronouns.

    ELRA-L0044Korean Lexicon
    This monolingual lexicon produced by Kaist Korterm consists of 31 476 compound nouns in Korean.

    ELRA-L0045New Oxford Dictionary of English, 2nd Edition
    NODE:  The NODE contains 170,000 entries covering all varieties of English worldwide. It has been designed for language engineering and to be used in NLP applications, and is available in XML or in SGML. The NODE data set includes morphological information linked to the lemma, phrases and idioms, subject classification, with over 200 key domains, semantic relationships, etc.

    ELRA-L0046NODE+DIMAP
    The first edition of the DIMAP version of NODE is a machine-tractable version of the machine-readable dictionary files in the DIMAP dictionary maintenance programs, adding syntactic and semantic information in the conversion. Apart from mechanisms which will allow research into representational formalisms and explorations of the use of these representations in extending the lexical database and in processing text for information extraction, text summarization, discourse analysis and other LE applications, DIMAP also includes semantic links between entries, thus making NODE+DIMAP a semantic network of the English language.

    ELRA-L0047New Oxford Thesaurus of English
    NOTE:  This thesaurus contains 628,000 alternative words, including 573,000 synonyms, the rest being antonyms, related terms, combining forms, and hyponyms, and is available in SGML. Nearly 38,000 senses are also presented with a corpus-based example.
    It is available in SGML.

    ELRA-L0048Oxford Paperback Thesaurus, 2nd edition
    The Oxford Paperback Thesaurus, available in SGML, contains 15,000 headwords, over 300,000 synonyms, and 29,000 different senses presented with corpus-based examples.

    ELRA-L0049SCIPER-FR-EURADIC French Monolingual Dictionary
    SCIPER-FR-EURADIC:  This French monolingual dictionary was increased and improved within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 112,216 lemmas (694,673 inflected forms), with their part of speech and some information related to their inflexion. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-L0050SCIPER-AN-EURADIC English Monolingual Dictionary
    SCIPER-AN-EURADIC:  This English monolingual dictionary was increased and improved within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 171,713 lemmas (365,823 inflected forms), with their part of speech and some information related to their inflexion. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-L0051SCIPER-AL-EURADIC German Monolingual Dictionary
    SCIPER-AL-EURADIC:  This German monolingual dictionary was developed within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 157,810 lemmas (17,634,834 inflected forms), with their part of speech and some information related to their inflexion. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-L0052SCIPER-ES-EURADIC Spanish Monolingual Dictionary
    SCIPER-ES-EURADIC:  This Spanish monolingual dictionary was increased and improved within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 83,952 lemmas (838,391 inflected forms), with their part of speech and some information related to their inflexion. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-L0053SCIPER-IT-EURADIC Italian Monolingual Dictionary
    SCIPER-IT-EURADIC:  This Italian monolingual dictionary was developed within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 70,951 lemmas (557,204 inflected forms), with their part of speech and some information related to their inflexion. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-L0054LABEL-LEX (MW)
    LABEL-LEX (MW) is a Portuguese formalized lexicon, containing 88 619 inflected multiword lexical units (formally, sequences of simple words).

    ELRA-L0055LABEL-LEX (SW)
    LABEL-LEX (SW) is a Portuguese formalized lexicon, containing 1,545,481 simple inflected words. Each dictionary entry is associated to a lemma; information about POS and morphological attributes - such as gender, number, person, case (for personal pronouns), tense, mood, diminutives, augmentatives, and superlative - is systematically formalized for each lexical entry.

    ELRA-L0056STO SprogTeknologisk Ordbase (Danish Lexicon for NLP/HLT Applications)
    STO:  The STO Lexicon is the most comprehensive computational lexicon of Danish comprising approx. 81,530 entry words including morphological, syntactical and semantic information and it is well integrated with the European activities in the field of lexicon development building on experience obtained from the PAROLE and SIMPLE projects. The model and descriptive method of the STO lexicon are kept compatible with the architecture and descriptive language of PAROLE/SIMPLE. A number of refinements, adaptations and language-specific extensions to the basic model are implemented in STO.

    ELRA-L0057Euskararen Datu-Base Lexikala (EDBL) – Lexical Database for Basque
    EDBL (Lexical database for Basque) is made up of about 75,000 entries divided into dictionary entries, verb forms and dependent morphemes, all of them with their respective morphological information. It was first developed as a lexical support for the spelling checker and corrector XUXEN, and later for the morphological analyser MORFEUS and the lemmatiser EUSLEM.

    ELRA-L0058British English Source Lexicon (BESL) version 2.2
    BESL consists of over 230,000 lemmas, over 350,000 word forms, 60,000 proper nouns, 3,000 abbreviations, and 58,000 multi-word compound nouns. Each headword is provided with a full listing of all inflected forms and other morphological variation. Every word form is marked for part of speech (using Penn TreeBank notation). Most single-word forms include a representation of IPA pronunciation. BESL covers both British and American English, and other spelling variants, with cross-references between corresponding forms. BESL is provided in XML.

    ELRA-L0059Offensive Word Filter 1
    This list features 4500 words and expressions for UK and US English usage with a grading system describing vocabulary type and offensive strength for each term, plus collocational information to help identify the terms in context. The list is provided in tab-delimited ASCII

    ELRA-L0060Offensive Word Filter 2
    This list features 2000 words and expressions, classified into 13 categories, for UK and US English usage with a grading system describing vocabulary type and offensive strength for each term, plus collocational information to help identify the terms in context. The list is provided in an Excel spreadsheet.

    ELRA-L0061The Oxford Spanish Dictionary
    This dictionary consists of 300,000 words and phrases, 500,000 translations, for 24 regional varieties of Spanish. It includes thousands of real, authentic example sentences carefully selected to illustrate the full range of meanings and typical contexts. The dictionary is provided in XML or SGML.

    ELRA-L0062French Source Lexicon
    This source lexicon contains morphological and phonetic data for French. It consists of over 90,000 headwords/lemmas, 400,000 wordforms, 1,000 abbreviations, and 35,000 proper nouns. Each headword lemma is provided with a full listing of its possible syntactic forms and spelling variants, along with information on their relationship to the headword form. In addition, a representation of the IPA pronunciation is given for every form. There is also information on domains in which the headwords are used, e.g. Computing, Engineering, Zoology. The lexicon is provided in SGML.

    ELRA-L0063Spanish Source Lexicon
    This source lexicon contains morphological and phonetic data for Spanish. It consists of over 575,000 wordforms, 1,000 abbreviations, and 25,000 proper nouns. Each headword lemma is provided with a full listing of its possible syntactic forms and spelling variants, along with information on their relationship to the headword form. In addition, a representation of the IPA pronunciation is given for every form. There is also information on domains in which the headwords are used, e.g. Computing, Engineering, Zoology. The lexicon is provided in SGML.

    ELRA-L0064Italian Source Lexicon
    This source lexicon contains morphological and phonetic data for Italian. It consists of over 115,000 headwords/lemmas and 925,000 wordforms. Each headword lemma is provided with a full listing of its possible syntactic forms and spelling variants, along with information on their relationship to the headword form. In addition, a representation of the IPA pronunciation is given for every form. There is also information on domains in which the headwords are used, e.g. Computing, Engineering, Zoology. The lexicon is provided in SGML.

    ELRA-L0065KORLEX – Croatian Lexicon
    The KORLEX - Croatian Lexicon provides a list of 118,252 Croatian lemmas (including 52,450 nouns, 8,985 adverbs, 14,937 verbs and 41,161 adjectives, as well as pronouns, determiners, prepositions/postpositions, conjunctions and numerals), i.e., words in canonical form, annotated with part-of-speech (POS) tag and lexical features.
    The lexicon data is compiled with the objective of covering the majority of text circulating in everyday use, such as in the news, in business, technological documentation, legal documentation, and politics. The resource is a flat textual file in which each textual line contains information about one lemma. The resource is encoded using ISO-8859-2 encoding, and sorted according to the standard Croatian lexicographic order.

    ELRA-L0066KORLEX – Serbian Lexicon
    The KORLEX - Serbian Lexicon provides a list of 108,491 Serbian lemmas (including 52,027 nouns, 9,153 adverbs, 15,522 verbs and 31,052 adjectives, as well as pronouns, determiners, prepositions/postpositions, conjunctions and numerals), i.e., words in canonical form, annotated with part-of-speech (POS) tag and lexical features.
    The lexicon data is compiled with the objective of covering the majority of text circulating in everyday use, such as in the news, in business, technological documentation, legal documentation, and politics. The resource is a flat textual file in which each textual line contains information about one lemma. The resource is encoded using ISO-8859-2 encoding, and sorted according to the standard Serbian lexicographic order.

    ELRA-L0067English lexicon with morphological information
    This English lexicon is made up of 174,000 inflected forms corresponding to 68,000 simple word lemmas (including 31,900 nouns, 11,800 verbs, 19,900 adjectives, 4,100 adverbs, 300 pronouns, articles, prepositions/postpositions and conjunctions). Each line in the resource file shows an inflected form, its part of speech, its related lemma and its morphological information.

    ELRA-L0068French lexicon with morphological information
    This French lexicon is made up of 424,000 inflected forms corresponding to 55,000 simple word lemmas (including 34,400 nouns, 7,300 verbs, 11,700 adjectives, 1,400 adverbs, 200 pronouns, articles, prepositions/postpositions and conjunctions). Each line in the resource file shows an inflected form, its part of speech, its related lemma and its morphological information.

    ELRA-L0069Italian lexicon with morphological information
    This Italian lexicon is made up of 862,500 inflected forms corresponding to 112,000 simple word lemmas (including 66,340 nouns, 12,030 verbs, 28,080 adjectives, 4,890 adverbs, 660 pronouns, articles, prepositions/postpositions and conjunctions). Each line in the resource file shows an inflected form, its part of speech, its related lemma and its morphological information.

    ELRA-L0070Italian lexicon with morphological information and clitic verbs
    This Italian lexicon is the same as the one described in ELRA-L0069, but with the addition of clitic verbs, which increases the number of inflected forms to 1,800,000 (still corresponding to 112,000 simple words lemmas). It contains 66,340 nouns, 12,030 verbs, 28,080 adjectives, 4,890 adverbs, 660 pronouns, articles, prepositions/postpositions and conjunctions. Each line in the resource file shows an inflected form, its part of speech, its related lemma and its morphological information.

    ELRA-L0071Spanish lexicon with morphological information
    This Spanish lexicon is made up of 816,000 inflected forms corresponding to 104,000 simple word lemmas (including 52,000 nouns, 9,800 verbs, 21,200 adjectives, 20,500 adverbs, 500 pronouns, articles, prepositions/postpositions and conjunctions). Each line in the resource file shows an inflected form, its part of speech, its related lemma and its morphological information.

    ELRA-L0072-01PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Full lexicon
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES.

    This lexicon is subdivided into five different subsets:
    L0072-01 Full lexicon
    L0072-02 Phonetic layer
    L0072-03 Morphological layer
    L0072-04 Syntactic layer
    L0072-05 Semantic layer

    ELRA-L0072-02PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Phonetic layer
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encoding were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering).

    This lexicon is subdivided into five different subsets:
    L0072-01 Full lexicon
    L0072-02 Phonetic layer
    L0072-03 Morphological layer
    L0072-04 Syntactic layer
    L0072-05 Semantic layer

    ELRA-L0072-03PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Morphological layer
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encoding were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering).

    This lexicon is subdivided into five different subsets:
    L0072-01 Full lexicon
    L0072-02 Phonetic layer
    L0072-03 Morphological layer
    L0072-04 Syntactic layer
    L0072-05 Semantic layer

    ELRA-L0072-04PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Syntactic layer
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encoding were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering).

    This lexicon is subdivided into five different subsets:
    L0072-01 Full lexicon
    L0072-02 Phonetic layer
    L0072-03 Morphological layer
    L0072-04 Syntactic layer
    L0072-05 Semantic layer

    ELRA-L0072-05PAROLE-SIMPLE-CLIPS PISA Italian Lexicon – Semantic layer
    PAROLE-SIMPLE-CLIPS is a four-level, general purpose lexicon that has been elaborated over three different projects. The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon comprises a total of 387,267 phonetic units, 53,044 morphological units (53,044 lemmas), 37,406 syntactic units (28,111 lemmas) and 28,346 semantic units (19,216 lemmas). The PAROLE-SIMPLE-CLIPS Pisa Italian Lexicon was encoded at the semantic level, in full accordance with the international standards set out in the PAROLE-SIMPLE model and based on EAGLES. Syntactic and semantic encoding were performed jointly with Thamus (Consortium for Multilingual Documentary Engineering).

    This lexicon is subdivided into five different subsets:
    L0072-01 Full lexicon
    L0072-02 Phonetic layer
    L0072-03 Morphological layer
    L0072-04 Syntactic layer
    L0072-05 Semantic layer

    ELRA-L0073DIINAR.1 - Arabic Lexical Resource
    DIINAR.1 is an Arabic Lexical Resource which includes a total number of 119,693 lemmas, fully vowelled, and distributed as follows: 29,534 nouns and adjectives, 19,457 verbs, 70,702 deverbals (including 23,274 infinitive forms, 17,904 active participles, 13,373 passive participles, 5,781 ‘analogous adjectives’, 10,370 ‘nouns of place & time’). The data is provided in Excel files and was generated with inflected forms. Each entry has been associated with morpho-syntactic specifiers.

    ELRA-L0074POLEX Polish Lexicon
    The POLEX Polish Lexicon is a morphological dictionary of Polish language. It comprises about 100,000 entries. The POLEX dictionary includes the core Polish vocabulary of general interest. It is based on a precise machine-interpretable formalism (coding system), the same for all categories (classes of speech). The dictionary entries are of the following form:
    BASIC_FORM+LIST_OF_STEMS+PARADIGMATIC_CODE+DISTRIBUTION_OF_STEMS
    It contains more than 42,000 nouns, 12,000 verbs, 15,000 adjectives, 25,000 participles, and about 200 pronouns. A simple lemmatiser (in form of PROLOG prototype) is also included.

    ELRA-L0075Bulgarian Linguistic Database
    This database contains 81,647 entries in Bulgarian with a linguistic environment tool (for WINDOWS XP). The data may be used for morphological analysis and synthesis, syntactic agreement checking, phonetic stress determining.

    ELRA-L0076Polderland Dutch Lexicon of Abbreviations and Acronyms
    The lexicon contains 2,180 Dutch abbreviations and acronyms. It complies with the official Dutch Spelling (2005/6). Each entry consists of an ID, word form, lemma and part of speech.

    ELRA-L0077Polderland Dutch General Lexicon
    The lexicon contains 400,463 Dutch words, comprising 236,369 nouns, 90,882 adjectives, 69,744 verbs, 2,120 adverbs, and 1,348 items from other categories (pronouns, determiners, articles, adpositions, conjunctions, numerals, etc.). It complies with the official Dutch Spelling (2005/6). The lexicon contains an ID, word form, lemma and part of speech.

    ELRA-L0078Polderland Dutch Lexicon of Names
    The lexicon contains 24,247 Dutch proper names. Various sorts of proper names are included, such as first names, last names, geographical names etc. Each entry contains an ID, word form, lemma, part of speech and proper name type.

    ELRA-L0079Polderland Dutch Lexicon of Business Terminology
    The lexicon contains 15,987 Dutch words from the business domain, comprising 13,774 nouns, 1,267 adjectives, 895 verbs, 9 adverbs, and 42 items from other categories. The lexicon complies with the official Dutch Spelling (2005). Each entry contains an ID, word form and part of speech.

    ELRA-L0080Polderland Dutch Lexicon of Legal Terminology
    The lexicon contains 6,207 Dutch words from the legal domain, comprising 4,781 nouns, 810 adjectives, 573 verbs, 12 adverbs and 31 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.

    ELRA-L0081Polderland Dutch Lexicon of Medical Terminology
    The lexicon contains 17,115 Dutch words from the medical domain, comprising 12,638 nouns, 3,107 adjectives, 1,273 verbs, 11 adverbs and 86 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.

    ELRA-L0082Polderland Dutch Lexicon of Social Terminology
    The lexicon contains 12,551 Dutch words from the social domain, comprising 9,984 nouns, 1,306 adjectives, 1,161 verbs, 56 adverbs and 44 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.

    ELRA-L0083Polderland Dutch Lexicon of Technical Terminology
    The lexicon contains 9,940 Dutch words from the technical/scientific domain, comprising 8,832 nouns, 950 adjectives, 111 verbs, 2 adverbs and 45 items from other categories. It complies with the official Dutch Spelling (2005/6). Each entry contains an ID, word form and part of speech.

    ELRA-L0084Macedonian Morphological Lexicon (MACPLEX)
    MACPLEX comprises two dictionaries: a dictionary of lemmas (89,026 entries) and a dictionary of word forms (1,480,201 entries). Morphological information (PoS, gender, case, definiteness, number for nouns, tense, person, etc. for verbs) is available for each entry. Out of the 1,480,201 word forms, there are 40,671 nouns, 12,235 adjectives, 20,874 verbs, 14,317 adverbs, 153 interjections, 64 conjunctions, 65 prepositions, 132 numerals, 66 pronouns, 63 particles and 386 residuals. The lexicon is available in Unicode.

    ELRA-L0085euLEX (Lexical Database for Basque)
    euLEX is a general lexicon which contains 115,000 entries, divided into 94,000 dictionary entries or lemmas, 12,000 allomorphs, 7,500 verb forms and about 1,200 dependent morphemes. All entries include linguistic information such as morphology and usage. The lexicon is in XML.

    ELRA-L0086Persian Multext-East framework lexicon
    This is a Persian (Farsi) morphosyntactic lexicon derived from the Persian 1984 corpus (Multext-East framework) (see ELRA-W0054). It contains the full inflectional paradigms of a superset of lemmas that appear in the Persian 1984 corpus. Each entry gives the word-form, its lemma and morphosyntactic description. The lexicon contains 13,247 entries.

    ELRA-L0087Persian Lexicon
    This is a Persian (Farsi) lexicon of more than 40,000 entries of non-inflected forms of words. Each word is transliterated based on the proposed framework from MBROLA (Text-To-Speech synthesizer). The database includes a large variety of descriptors for each entry (plural, homograph, ...). The lexicon is provided in a MS Access database.

    ELRA-L0088Arabic Morphological Dictionary
    The Arabic Morphological Dictionary contains 4,912,749 entries, including 3,374,852 nouns, 1,537,699 verbs, 198 grammatical words. All files are provided as plain text in UTF8 character encoding, which represents about 154 Mb of data.

    ELRA-L0089Macedonian lexicon of toponyms (MACPLEX_TOPO)
    MACPLEX_TOPO lexicon contains 1,398 lemmas and 40,246 word forms (787 places, 428 regions, 68 waters, 47 peoples, 45 mountains, 27 lands). New words related to toponyms (their inhabitants and related adjectives) are derived. The lexicon is available in Unicode.

    ELRA-L0090Macedonian lexicon of proper nouns (MACPLEX_PROPERS)
    MACPLEX_PROPERS contains 15,422 lemmas and 157,321 word forms (2,516 first names, 12,322 last names, 148 other human names, 426 companies and 22 brands). Adjectives related to proper nouns are derived. The lexicon is available in Unicode.

    ELRA-L0091Macedonian lexicon of derived adjectives (MACPLEX_ADJDERV)
    This lexicon contains 12,073 lemmas and 281,488 word forms (10,233 with suffix –чки, 1,840 with suffix –билен). The lexicon is available in Unicode.

    ELRA-L0092Macedonian lexicon of participles (MACPLEX_ADJPARTIC)
    This lexicon contains 19,552 lemmas and 1,251,328 word forms. The lemmas are derived from verbs. The lexicon is available in Unicode.

    ELRA-L0093Macedonian lexicon of compound words (MACPLEX_COMP)
    This lexicon contains 784 lemmas and 6,289 word forms (576 nouns, 25 adjectives, 73 adverbs, 66 interjections, 17 numerals, 15 pronouns and 12 residuals). The lexicon is available in Unicode.

    ELRA-L0094CEPLEXicon
    CEPLEXicon results from the automatic tagging of two corpora, using a tagger and the POS tag set. The automatic tagging was followed by a partial manual revision. This lexicon covers all the speech produced by seven monolingual Portuguese children aged 1;02.00 to 3;11.12, in a total of 114 files, each corresponding to 40-50 minutes of child-adult interaction in a naturalistic setting. The lexicon is presented in .xls format and includes 2201 lemmas, the number of occurrences of each lemma in three different age periods, frequency of the lemma in each period and age of first occurrence for each child.

    ELRA-L0095-01GLiCom Spanish Wordform list – Regular word-forms + verb-clitic combinations
    GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. The lexicon is distributed in two sublexicons: a list of wordforms which contains 1,152,242 entries, and a list of verb-clitic combinations which contains 4,283,637 entries.

    ELRA-L0095-02GLiCom Spanish Wordform list – Regular word-forms
    GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. This set consists of a subdivision of the full lexicon and contains the list of word forms which consists of 1,152,242 entries. For the full lexicon, see ELRA-L0095-01.

    ELRA-L0095-03GLiCom Spanish Wordform list – Verb-clitic combinations
    GLiCom Spanish Wordform List v.1 is a computational lexicon of inflected wordforms in Spanish. Each entry has the following information: (i) lemma, (ii) morphosyntactic tag, and (iii) word type. This set consists of a subdivision of the full lexicon and contains the list of verb-clitic which consists of 4,283,637 entries. For the full lexicon, see ELRA-L0095-01.

    ELRA-L0096MCL - Multifunctional Computational Lexicon of Contemporary Portuguese
    MCL is a 26,443 lemma Frequency Lexicon with 140,315 tokens extracted from CORLEX, a contemporary Portuguese corpus (16,210,438 words). In order to extract the lexicon, all the different lexical forms occurring in the corpus were indexed and subsequently tagged morphosyntactically and lemmatised by PALAVROSO. Each lemma in MCL is followed by morphosyntactic and quantitative information.

    ELRA-L0097LEX-MWE-PT - Word Combination in Portuguese
    LEX-MWE-PT is a lexicon of European Portuguese containing multiword expressions (MWE) extracted from a balanced 50.8M-word written corpus. The lexicon covers 1,198 lemmas (composed of single words from different PoS categories: nouns, adjectives, verbs and adverbs); 12,753 MWE lemmas (which include inflectional variants of the MWE lemmas); and 242,233 concordances of those MWE manually verified.

    ELRA-L0098Arabic dictionary of inflected words
    This dictionary consists of a list of 6 million inflected forms, fully vowelized, and tagged with grammatical information which includes POS and grammatical features, including number, gender, case, definiteness, tense, mood and compatibility with clitic agglutination. The data is formatted in conformity with the data formats of Unitex/GramLab.
    This dictionary is also available together with recognition of agglutinated clitics and inflection system in the ELRA Catalogue under reference ELRA-L0099.

    ELRA-L0099Arabic dictionary of inflected words with recognition of agglutinated clitics and inflection system
    This dictionary consists of 6 million inflected forms, fully vowelized, generated in compliance with the grammatical rules of Arabic and tagged with grammatical information which includes POS and grammatical features, including number, gender, case, definiteness, tense, mood and compatibility with clitic agglutination. It is accompanied by a grammatical resource that recognizes hundreds of millions of valid agglutinated words. In order to be able to update the full-form dictionary, a dictionary of 65 000 lemmas and the data required to inflect them and regenerate the full-form dictionary are also provided. The data is formatted in conformity with the data formats of Unitex/GramLab.

    This dictionary is also available without recognition of agglutinated clitics and without inflection system in the ELRA Catalogue under reference ELRA-L0098.

    ELRA-L0100French dictionary of definitions (SYNAPSE)
    The French dictionary of definitions (SYNAPSE) consists of 216,835 entries (147,378 nouns, 80,552 adjectives, 24,001 verbs, 4,677 adverbs, 1,560 prefixes, 107 prepositions, 614 interjections, 147 pronouns, 42 conjunctions, 27 articles), 309,078 definitions and 7,395 phraseological units (phrases). Grammatical information for each entry consists of: grammatical category, gender, number, inflected forms. This dictionary is provided in XML format together with its DTD.

    ELRA-M0001Basic multilingual lexicon (MEMODATA)
    30,000 entries (associated by the meaning) for French, English, Italian, German, Spanish with lexical categories.

    ELRA-M0002-01Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Economics, law & business management
    10,642 entries (with morphological information) for Economics, law & business management.

    ELRA-M0002-02Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Leisure, Tourism, Sports, Food
    3,144 entries (with morphological information) for Leisure, Tourism, Sports, Food.

    ELRA-M0002-03Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Geography, History, Arts
    4,116 entries (with morphological information) for Geography, History, Arts.

    ELRA-M0002-04Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Sociology, Psychology, Pedagogy
    4,089 entries (with morphological information) for Sociology, Psychology, Pedagogy.

    ELRA-M0002-05Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Natural and medical sciences
    10,535 entries (with morphological information) for Natural and medical sciences.

    ELRA-M0002-06Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Exact sciences, Physics, Chemistry, Geology
    10,616 entries (with morphological information) for Exact sciences, Physics, Chemistry, Geology.

    ELRA-M0002-07Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Data Processing, Electronics, Telecoms
    4,904 entries (with morphological information) for Data Processing, Electronics, Telecoms.

    ELRA-M0002-08Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Technology, Engineering & Construction
    11,953 entries (with morphological information) for Technology, Engineering & Construction.

    ELRA-M0002-09Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Economics
    1,320 entries (with morphological information) for Economics.

    ELRA-M0002-10Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Data Processing
    3,565 entries (with morphological information) for Data Processing.

    ELRA-M0002-11Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Telecommunications
    3,733 entries (with morphological information) for Telecommunications.

    ELRA-M0002-12Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Electrical Engineering
    1,760 entries (with morphological information) for Electrical Engineering.

    ELRA-M0002-13Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Plastics and Chemistry
    9,022 entries (with morphological information) for Plastics and Chemistry.

    ELRA-M0002-14Bilingual Spanish-English and English-Spanish lexicons (INCYTA) - Aeronautics, Navigation, Mechanical Engineering
    23,170 entries (with morphological information) for Aeronautics, Navigation, Mechanical Engineering.

    ELRA-M0003Danish-German dictionary (Institut for Erhvervsinformatik)
    10,000 entries giving the German lexeme and Danish equivalent with word class, subject area, indication of structural changes, developed for machine translation.

    ELRA-M0004-01Dutch-French Lexicon (LanTmark)
    Vocabularies for transfer: General Vocabulary, 26,000 entries.
    Each entry contains domain information, source language disambiguation, features, target language actions.

    ELRA-M0004-02Dutch-French Lexicon (LanTmark)
    Vocabularies for transfer: Administrative, 32,000 entries.
    Each entry contains domain information, source language disambiguation, features, target language actions.

    ELRA-M0004-03Dutch-French Lexicon (LanTmark)
    Vocabularies for transfer: Data processing, 10,000 entries.
    Each entry contains domain information, source language disambiguation, features, target language actions.

    ELRA-M0005English-French Lexicon (LanTmark)
    General vocabulary for transfer. 33,287 entries consisting of nouns (about 14,000), verbs (about 7,000), adjectives (about 5,000), adverbs (about 1,000), including a domain information, source language disambiguation, features, target language actions.

    ELRA-M0006-01French-Dutch Lexicon (LanTmark)
    Vocabularies for transfer: General Vocabulary, 34,000 entries.
    Each entry contains source language disambiguation, features, and target language actions, developed for automatic translation.

    ELRA-M0006-02French-Dutch Lexicon (LanTmark)
    Vocabularies for transfer: Administrative, 18,000 entries.
    Each entry contains source language disambiguation, features, and target language actions, developed for automatic translation.

    ELRA-M0006-03French-Dutch Lexicon (LanTmark)
    Vocabularies for transfer: Data processing, 10,000 entries.
    Each entry contains source language disambiguation, features, and target language actions, developed for automatic translation.

    ELRA-M0007French-English Lexicon (LanTmark)
    General vocabulary for transfer. 39,453 entries: nouns (about 21,000), verbs (about 9,000), adjectives (about 3,000), adverbs (about 1,000), including domain information, source language disambiguation, features, and target language actions, developed for automatic translation.

    ELRA-M0008-01German-Danish dictionaries (Institut for Erhvervsinformatik)
    6,800 technical entries giving the German lexeme and Danish equivalent with word class, subject area, indication of structural changes, developed for machine translation.

    ELRA-M0008-02German-Danish dictionaries (Institut for Erhvervsinformatik)
    15,500 general entries giving the German lexeme and Danish equivalent with word class, subject area, indication of structural changes, developed for machine translation.

    ELRA-M0009-01THAMUS Bilingual dictionaries - Computer Science (1)
    Computer Science, canonical forms: 17,800 entries, German=>Italian. Data contain morphological coding.

    ELRA-M0009-02THAMUS Bilingual dictionaries - Computer Science (2)
    Computer Science, canonical forms: 17,800 entries, Italian=>German. Data contain morphological coding.

    ELRA-M0009-03THAMUS Bilingual dictionaries - Computer Science (3)
    Computer science, inflected forms: 35,000 entries, German=>Italian. Data contain morphological coding.

    ELRA-M0009-04THAMUS Bilingual dictionaries - Computer Science (4)
    Computer Science, inflected forms: 35,000 entries, Italian=>German. Data contain morphological coding.

    ELRA-M0010-01THAMUS Bilingual dictionaries - Aeronautics (1)
    Aeronautics: 6,300 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-02THAMUS Bilingual dictionaries - Aeronautics (2)
    Aeronautics: 6,300 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-03THAMUS Bilingual dictionaries - Law (1)
    Law, canonical forms: 8,900 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-04THAMUS Bilingual dictionaries - Law (2)
    Law, canonical forms: 8,900 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-05THAMUS Bilingual dictionaries - Law (3)
    Law, inflected forms: 18,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-06THAMUS Bilingual dictionaries - Law (4)
    Law, inflected forms: 18,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-07THAMUS Bilingual dictionaries - Computer science (5)
    Computer science, canonical forms: 15,700 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-08THAMUS Bilingual dictionaries - Computer science (6)
    Computer science, canonical forms: 15,700 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-09THAMUS Bilingual dictionaries - Computer science (7)
    Computer science, inflected forms: 32,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-10THAMUS Bilingual dictionaries - Computer science (8)
    Computer science, inflected forms: 32,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-11THAMUS Bilingual dictionaries - Medicine (1)
    Medicine, canonical forms: 20,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-12THAMUS Bilingual dictionaries - Medicine (2)
    Medicine, canonical forms: 20,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-13THAMUS Bilingual dictionaries - Economics (1)
    Economics, canonical forms: 50,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-14THAMUS Bilingual dictionaries - Economics (2)
    Economics, canonical forms: 50,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-15THAMUS Bilingual dictionaries - Economics (3)
    Economics, inflected forms: 86,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-16THAMUS Bilingual dictionaries - Economics (4)
    Economics, inflected forms: 86,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-17THAMUS Bilingual dictionaries - Engineering (1)
    Engineering, canonical forms: 13,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-18THAMUS Bilingual dictionaries - Engineering (2)
    Engineering, canonical forms: 13,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0010-19THAMUS Bilingual dictionaries - Engineering (3)
    Engineering, inflected forms: 27,000 entries, English=>Italian. Data contain morphological coding.

    ELRA-M0010-20THAMUS Bilingual dictionaries - Engineering (4)
    Engineering, inflected forms: 27,000 entries, Italian=>English. Data contain morphological coding.

    ELRA-M0013Bilingual Collocational Dictionary (Horst Bogatz)
    The bilingual English-German collocational dictionary consists of around 69,000 English headwords, including concepts expressed with more than one word and compounds. It contains 60,285 fixed collocations, 2,141 verbs, 4,662 adjectives, 1,229 adverbs, and the synonyms that collocate with the headwords. The German equivalents add up to the largest collection of fixed German collocations as well.

    ELRA-M0014-01Bilingual Dictionaries - English <=> Spanish I
    25000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-02Bilingual Dictionaries - English <=> Spanish II
    60000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-03Bilingual Dictionaries - English <=> Spanish III
    100000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-04Bilingual Dictionaries - English <=> Spanish IV
    200000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-05Bilingual Dictionaries - English <=> French I
    40000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-06Bilingual Dictionaries - English <=> French II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-07Bilingual Dictionaries - English <=> French III
    100000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-08Bilingual Dictionaries - English <=> French IV
    200000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-09Bilingual Dictionaries - English <=> German I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-10Bilingual Dictionaries - English <=> German II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-11Bilingual Dictionaries - English <=> German III
    126000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-12Bilingual Dictionaries - English <=> Italian I
    20000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-13Bilingual Dictionaries - English <=> Italian II
    40000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-14Bilingual Dictionaries - English <=> Brazilian Portuguese I
    40000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-15Bilingual Dictionaries - English <=> Brazilian Portuguese II
    80000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-16Bilingual Dictionaries - English <=> Brazilian Portuguese III
    400000+ entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-17Bilingual Dictionaries - English <=> Portuguese I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-18Bilingual Dictionaries - English <=> Portuguese II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-19Bilingual Dictionaries - English <=> Portuguese III
    110000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-20Bilingual Dictionaries - English <=> Portuguese IV
    234000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-21Bilingual Dictionaries - English <=> Dutch I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-22Bilingual Dictionaries - English <=> Dutch II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-23Bilingual Dictionaries - English <=> Dutch III
    110000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-24Bilingual Dictionaries - English <=> Danish I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-25Bilingual Dictionaries - English <=> Danish II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-26Bilingual Dictionaries - English <=> Danish III
    110000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-27Bilingual Dictionaries - English <=> Swedish I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-28Bilingual Dictionaries - English <=> Swedish II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-29Bilingual Dictionaries - English <=> Swedish III
    110000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-30Bilingual Dictionaries - English <=> Finnish I
    30000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-31Bilingual Dictionaries - English <=> Icelandic I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-32Bilingual Dictionaries - English <=> Icelandic II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-33Bilingual Dictionaries - English <=> Icelandic III
    100000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-34Bilingual Dictionaries - English <=> Russian I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-35Bilingual Dictionaries - English <=> Russian II
    72000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-36Bilingual Dictionaries - English <=> Russian III
    120000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-37Bilingual Dictionaries - English <=> Russian Business
    60000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage and semantic features.

    ELRA-M0014-38Bilingual Dictionaries - English <=> Russian Aerospace and Aeronautics
    60000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage and semantic features.

    ELRA-M0014-39Bilingual Dictionaries - English <=> Russian Automotive
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage and semantic features.

    ELRA-M0014-40Bilingual Dictionaries - English <=> Russian Minerals & Mining
    60000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage and semantic features.

    ELRA-M0014-41Bilingual Dictionaries - English <=> Polish I
    30000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-42Bilingual Dictionaries - English <=> Polish II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-43Bilingual Dictionaries - English <=> Polish III
    124000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-44Bilingual Dictionaries - English <=> Polish IV
    150000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-45Bilingual Dictionaries - English <=> Hungarian I
    30000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-46Bilingual Dictionaries - English <=> Hungarian II
    80000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-47Bilingual Dictionaries - English <=> Hungarian III
    124000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-48Bilingual Dictionaries - English <=> Czech I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-49Bilingual Dictionaries - English <=> Romanian Starter
    10000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-50Bilingual Dictionaries - English <=> Croatian I
    30000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-51Bilingual Dictionaries - English <=> Bosnian I
    30000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-52Bilingual Dictionaries - English <=> Serbian I (Latin or Cyrillic)
    30000 entries
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-53Bilingual Dictionaries - English <=> Japanese I
    40000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0014-54Bilingual Dictionaries - English <=> Greek
    60000 entries.
    Bilingual dictionaries containing local linguistic variant, local spelling variant, words frequency, usage (familiar, old, slang, etc.) and semantic features.

    ELRA-M0015EuroWordNet English Addition to English WordNet
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0015 = 16361 synsets

    ELRA-M0016EuroWordNet Dutch
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0016 = 44015 synsets

    ELRA-M0017EuroWordNet Spanish
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0017 = 23370 synsets

    ELRA-M0018EuroWordNet Italian
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0018 = 48529 synsets

    ELRA-M0019EuroWordNet German
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0019 = 15132 synsets

    ELRA-M0020EuroWordNet French
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0020 = 22745 synsets

    ELRA-M0021EuroWordNet Czech
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0021 = 12824 synsets

    ELRA-M0022EuroWordNet Estonian
    Each EuroWordNet database is composed of the following:
    - The Inter-Lingual-Index, which is a list of records (ILI-records), in the form of synsets mainly taken from WordNet1.5 or manually created.
    - A top-ontology which consists of an ontology of 63 basic semantic classes based on fundamental distinctions.
    - A domain-ontology which consists of an ontology of subject-domains optionally assigned to ILI-records.
    - A selection of ILI-records, the so-called Base-Concepts, which play a major role in the different wordnets.
    - WordNet1.5 (91591 synsets; 168217 meanings; 126520 entry words) in EuroWordNet format.
    Number of synsets for M0022 = 9317 synsets

    ELRA-M0025Bilingual English-Russian Russian-English Dictionaries
    Produced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), these bilingual dictionaries contain more than 350,000 pairs of words (in tabular form) in XML format:
    - Russian-English dictionary - more than 130,000 entries
    - English-Russian dictionary - more than 95,000 entries
    Each entry contains: source word (lemma); part of speech of source word; target word(s) (lemma(s)), grouped by same meaning; part of speech of target word(s); domain(s).

    ELRA-M0026-01MultiWordNet database (included semantic fields) (MultiWordNet)
    MultiWordNet:  MultiWordNet contains information about the following aspects of the English and Italian lexical: lexical relations between words, semantic relations between lexical concepts, correspondences between Italian and English lexical concepts, semantic fields. Information about 51,000 Italian words meanings and 28,000 synsets (in correspondence with the English equivalents) is included. MultiWordNet can be used for NLP applications such as information retrieval, semantic tagging, disambiguation, terminology, etc.

    ELRA-M0026-02Labelling of WordNet 1.6 with semantic fields (WordNet Domains)
    MultiWordNet:  MultiWordNet contains information about the following aspects of the English and Italian lexical: lexical relations between words, semantic relations between lexical concepts, correspondences between Italian and English lexical concepts, semantic fields. Information about 51,000 Italian words meanings and 28,000 synsets (in correspondence with the English equivalents) is included. MultiWordNet can be used for NLP applications such as information retrieval, semantic tagging, disambiguation, terminology, etc.

    ELRA-M0027Oxford French Minidictionary
    Over 100,000 words, phrases and translations are included in this bilingual minidictionary, which is available in SGML. Complementary information, such as usage notes, is also provided.

    ELRA-M0028Concise Oxford-Duden German Dictionary
    This bilingual dictionary contains 150,000 words and phrases, and 240,000 translations, and is available in XML and SGML.

    ELRA-M0029Pocket Oxford Italian Dictionary
    This is a mid-sized dictionary to cover essential terms and vocabulary, available in XML and SGML. It contains 80,000 words and phrases, and 115,000 translations.

    ELRA-M0030Concise Oxford Spanish Dictionary
    The coverage of this concise Oxford Spanish dictionary includes 24 varieties of Spanish as it is written and spoken throughout the Spanish-speaking world. This bilingual dictionary contains 170,000 words and phrases and 240,000 translations. It is available in SGML and XML.

    ELRA-M0031Oxford Business French Dictionary
    This dictionary covers the general language of Business across a range of core areas. It contains over 50,000 words and phrases, and is available in SGML.

    ELRA-M0032Oxford Business Spanish Dictionary
    This dictionary covers the general language of Business across a range of core areas. It contains over 50,000 words and phrases, and is available in SGML.

    ELRA-M0033SCI-FRAN-EURADIC French-English Bilingual Dictionary
    SCI-FRAN-EURADIC:  This bilingual dictionary was increased and improved within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 243,539 pairs of French-English terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-M0034SCI-FRAL-EURADIC French-German Bilingual Dictionary
    SCI-FRAL-EURADIC:  This bilingual dictionary was developed within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 170,967 pairs of French-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0035, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-M0035SCI-FRES-EURADIC French-Spanish Bilingual Dictionary
    SCI-FRES-EURADIC:  This bilingual dictionary was increased and improved within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 102,941 pairs of French-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0036, ELRA-M0037, ELRA-M0038.

    ELRA-M0036SCI-FRIT-EURADIC French-Italian Bilingual Dictionary
    SCI-FRIT-EURADIC:  This bilingual dictionary was developed within the French national project EurRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. It contains 116,587 pairs of French-Italian terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0037, ELRA-M0038.

    ELRA-M0037SCI-ANES English-Spanish Bilingual Dictionary
    SCI-ANES:  This bilingual dictionary contains around 60,000 pairs of English-Spanish terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0038.

    ELRA-M0038SCI-AN-ALL English-German Bilingual Dictionary
    SCI-AN-ALL:  This bilingual dictionary contains 59,758 pairs of English-German terms, with their part of speech. The data are presented in a table format, where information related to each entry is separated by ";".

    See also ELRA-L0049, ELRA-L0050, ELRA-L0051, ELRA-L0052, ELRA-L0053, ELRA-M0033, ELRA-M0034, ELRA-M0035, ELRA-M0036, ELRA-M0037.

    ELRA-M0039SCI-ALRU German-Russian Bilingual Dictionary
    SCI-ALRU:  This bilingual dictionary contains around 80,000 pairs of German-Russian terms, with their part of speech.
    The data are presented in a table format, where information related to each entry is separated by ";".

    ELRA-M0040DixAF (Bilingual Dictionary French Arabic, Arabic French)
    DixAF:  DixAF is a French-Arabic, Arabic-French dictionary, which consists of around 125,000 binary links between ca. 43,000 French entries and ca. 35,000 Arabic entries.

    ELRA-M0041Bulgarian WordNet
    The Bulgarian WordNet is a network of lexical-semantic relations, an electronic thesaurus with a structure modelled on that of the Princeton WordNet and those constructed in the EuroWordNet and BalkaNet project. Bulgarian WordNet describes meaning of a lexical unit by placing it within a network of semantic relations, such as hypernyny, meronymy, antonymy etc. It contains 38209 synsets, 83493 literals, 89242 relations (including 58095 semantic relations, 4172 extralinguistic relations).

    ELRA-M0042ItalWordNet (Italian WordNet)
    ItalWordNet (Italian WordNet) is an updated version of the EuroWordNet Italian database. The ItalWordNet database was produced within a national Italian programme called SI-TAL. It contains a total of 49,360 synsets. The ItalWordNet is provided in XML format. The original EuroWordNet Italian database is also included in this package.

    ELRA-M0043Russian => English MT optimized lexicon in OLIF XML
    This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries.

    ELRA-M0044English => Swahili Bilingual Lexicon
    This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries.

    ELRA-M0045Cebuano => English Bilingual Lexicon
    This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries.

    ELRA-M0046English => Czech Bilingual Lexicon
    This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries.

    ELRA-M0047Czech WordNet
    The Czech WordNet captures nouns, verbs, adjectives, and partly adverbs, and contains 28,201 word senses (synsets). Every synset encodes the equivalence relation between several literals (at least one is present), having a unique meaning, belonging to one and the same part of speech, and expressing the same lexical meaning. Each Czech synset is related to the corresponding synset in the Princeton WordNet 2.0. via its identification number ID. There is at least one language-internal relation between a synset and another synset in the database.

    ELRA-M0048LatinWordNet
    LatinWordNet contains information about the following aspects of the Latin and English lexicon: lexical relations between words, semantic relations between lexical concepts, correspondences between Latin and English lexical concepts. LatinWordNet covers nouns, verbs, adjectives and adverbs, and contains 8,978 synsets in correspondence with the English equivalents (and with all the MultiWordNet-based wordnets).

    ELRA-M0049Basque WordNet
    The Basque WordNet models nouns, verbs and adjectives. Each sense is linked to a so-called synset (for a total of 30,281 synsets). Every synset encodes the synonymy relation between (possibly) several words (synonyms), having a unique meaning, belonging to one and the same part of speech (specified in the POS tag value), and expressing the same lexical meaning. Each synset is related to the corresponding synset in the English WordNet 1.6. via its identification number ID, which includes the synset number and the POS tag. The only exceptions are newly created synsets to account for cultural concepts not present in WordNet 1.6.

    ELRA-M0050The MWN.PT - MultiWordnet of Portuguese
    MWN.PT - MultiWordnet of Portuguese (version 1) spans over 17,200 manually validated concepts/synsets, linked under the semantic relations of hyponymy and hypernymy. These concepts are made of over 21,000 word senses/word forms and 16,000 lemmas from both European and American variants of Portuguese. They are aligned with the translationally equivalent concepts of the English Princeton WordNet and, transitively, of the MultiWordNets of Italian, Spanish, Hebrew, Romanian and Latin.

    ELRA-S0001ACCOR - English
    Acoustic and articulatory multilingual database recorded as part of the ESPRIT-ACCOR project investigating cross-language acoustic-articulatory correlations in coarticulatory processes. Only English is available.

    ELRA-S0003BDLEX 23000
    A phonetically transcribed French lexicon of 23,000 canonical entries (leading to over 270,000 forms) with the corresponding graphemical, phonological and morphosyntactical attributes.

    ELRA-S0004BDLEX
    Lexicon for written and spoken French including 440,000 inflected forms with spelling, pronunciation (phonology) and morphosyntatic attributes

    ELRA-S0005BDSONS Base de données des sons du français
    BDSONS:  Speech database with two subsets: evaluation (sentences, logatomes, numbers, digits, etc.) & acoustic modelling (sequences of CVCV, various types of sentences, etc.). The corpus consists of 16 male and 16 female speakers.

    ELRA-S0006BREF-80
    BREF Sub-corpus containing training data of 5,330 sentences read by 80 French speakers. Texts were selected from the French newspaper Le Monde (over 20,000 words).

    ELRA-S0007BREF-POLYGLOT
    BREF Sub-corpus containing training data of 3,193 sentences read by 6 French speakers . The sentences were selected to cover a wide range of phonetic contexts.

    ELRA-S0008COLLECT
    500 speakers, half of whom called from Turin and the other half from all over Italy, automatically prompted to utter the 10 Italian digits and 5 command words.

    ELRA-S0009COST232
    Multi-English Speech database - 797 successful calls received in Italy and in the UK, using different types of collecting equipment. Repetition of the same vocabulary the "TI (Texas Instrument) words" (digits + yes, no, go, etc.).

    ELRA-S0010Dutch Polyphone Database
    Telephone speech from 5,050 Dutch speakers. Approx. 44 items per speaker. Read & spontaneous speech (isolated words, digits, sentences, etc.).

    ELRA-S0011English SpeechDat Polyphone database DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc.. 1,000 speakers recorded over digital telephone lines using fixed telephone sets.

    ELRA-S0012English SpeechDat(M) Polyphone database DB2
    Phonetically rich sentences sub-set.
    See ELRA-S0011

    ELRA-S0013Erlanger Bahnansage - ERBA
    ERBA:  Over 10,000 utterances read by over 100 German speakers. Domain of train inquiries.

    ELRA-S0014-01EUROM1f French
    The multilingual European speech database.The first really multilingual speech database produced in Europe. Over 60 speakers per language who pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0014-02EUROM1e English
    The multilingual European speech database.The first really multilingual speech database produced in Europe. Over 60 speakers per language who pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0014-03EUROM1g German
    The multilingual European speech database.The first really multilingual speech database produced in Europe. Over 60 speakers per language who pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0015EUROM1i
    The multilingual European speech database.The first really multilingual speech database produced in Europe. Over 60 speakers per language who pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0016FRESCO: French Polyphone Database (SpeechDat(M)) DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc.. French SpeechDat (Polyphone) database containing 35,000 utterances from 1,000 callers over the telephone in France.

    ELRA-S0017FRESCO: French Polyphone Database (SpeechDat(M)) DB2
    French (SpeechDat(M)) polyphone database.
    Phonetically rich sentences sub-set. See ELRA-S0016

    ELRA-S0018German Polyphone Database (SpeechDat(M)) DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc. German read and spontaneous speech from 1,000 speakers.

    ELRA-S0019German Polyphone Database (SpeechDat(M)) DB2
    German Polyphone Database (SpeechDat(M))
    Phonetically rich sentences sub-set. See ELRA-S0018

    ELRA-S0020GRONINGEN
    Over 20 hours of Dutch read speech material (short texts, short sentences, etc.), from 238 speakers.

    ELRA-S0021M2VTS Speaker Verification Database
    Multi Modal Verification for Teleservices and Security applications project. Multilingual data base designed to facilitate access control using multimodal identification of human faces (speech & image).

    ELRA-S0022Onomastica
    Onomastica Multi-Language Pronunciation Dictionaries covering city & town names, street names, family names, first names, product names, for 11 European languages. Only German is available now.

    ELRA-S0023PHONDAT 1 - PD1 (2nd edition)
    Read speech from 201 German speakers who read 450 different sentences each. Eight of them read the whole sentence corpus.

    ELRA-S0024PHONDAT 2 - PD2 (2nd edition)
    200 different sentences from a train inquiry task read by 16 German speakers, provided with phonological segmentation by hand plus other labelling.

    ELRA-S0025SIEMENS 100 - SI100
    Approx. 100 sentences extracted from the German newspaper SudDeutsch Zeitungen and read by 101 speakers.

    ELRA-S0026SIEMENS 1000 - SI1000
    Approx. 1,000 sentences extracted from the German newspaper SudDeutsch Zeitungen and read by 10 speakers.

    ELRA-S0027SieTill (Siemens Tillman)
    Telephone Speech Database database with 730 speakers (338 female, 392 male), and 36,000 utterances (digit sequences, dates, spelled names, ...).

    ELRA-S0028The "SIVA" Speech Database for Speaker Verification and Identification
    Speech Database for Speaker Verification and Identification. Over 2,000 calls in Italian language, collected over the fixed telephone network.

    ELRA-S0029Strange Corpus 1 - SC1 (ACCENTS)
    'Nordwind und Sonne' story read by 72 speakers with foreign accent and 16 native German speakers.

    ELRA-S0030-01Swiss-French Polyphone Database 1000 speakers
    This speech database contains the recordings of 1,000 speakers who answered around 10 questions leading to spontaneous speech, and read about 28 items from a form supplied by IDIAP.

    ELRA-S0030-02Swiss-French Polyphone Database 4000 speakers
    This speech database contains the recordings of 4,000 speakers who answered around 10 questions leading to spontaneous speech, and read about 28 items from a form supplied by IDIAP.

    ELRA-S0031TED Translanguage English Database
    Translanguage English Database.
    Recordings made of 188 oral presentations in English, given at Eurospeech'93 in Berlin (high percentage of non native English speakers).

    ELRA-S0032TEDphone (Polyphone-like Translanguage English Database)
    TEDPhone:  Polyphone/SpeechDat-like recordings of 64 speakers in English and in their native language.

    ELRA-S0033BDBRUIT
    Recordings of French speech, corrupted with perturbations due to noisy environments, especially the Lombard effect. 5 male and 5 female speakers uttered sentences, digits, etc.

    ELRA-S0034-01VERBMOBIL - VM CD 1.0.3 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    63 Dialogues 209 Appointments, 1840 Turns.
    1 CDROM.

    ELRA-S0034-02VERBMOBIL - VM CD 1.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    63 Dialogues 209 Appointments, 1840 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentations. All files were validated according to BAS guidelines.
    1 CDROM

    ELRA-S0034-03VERBMOBIL - VM CD 2.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    81 Dialogues 227 Appointments, 1538 Turns.
    1 CDROM.

    ELRA-S0034-04VERBMOBIL - VM CD 2.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    81 Dialogues 227 Appointments, 1538 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentations. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-05VERBMOBIL - VM CD 3.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    45 Dialogues 184 Appointments, 1214 Turns.
    1 CDROM.

    ELRA-S0034-06VERBMOBIL - VM CD 3.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    45 Dialogues 184 Appointments, 1214 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentations. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-07VERBMOBIL - VM CD 4.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    72 Dialogues, 181 Appointments, 1,588 Turns.
    1 CDROM.

    ELRA-S0034-08VERBMOBIL - VM CD 4.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    72 Dialogues 181 Appointments 1,588 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentation and partitur files. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-09VERBMOBIL - VM CD 5.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    101 Dialogues, 256 Appointments, 2,154 Turns.
    1 CDROM.

    ELRA-S0034-10VERBMOBIL - VM CD 5.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    101 Dialogues, 256 Appointments 2,154 Turns.This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentation and partitur files. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-11VERBMOBIL - VM CD 6.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    146 Dialogues, 191 Appointments, 1,828 Turns.
    1 CDROM.

    ELRA-S0034-12VERBMOBIL - VM CD 6.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    146 Dialogues, 191 Appointments 1,828 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 1 Header structure, software and speaker documentation. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-13VERBMOBIL - VM CD 7.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    68 Dialogues, 238 Appointments, 1,739 Turns.
    1 CDROM.

    ELRA-S0034-14VERBMOBIL - VM CD 7.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    68 Dialogues, 238 Appointments, 1,739 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentation and partitur files. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-16VERBMOBIL - VM CD 8.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    167 Dialogues, 167 Appointments, 1,181 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 1 Header structure, software and speaker documentation. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-17VERBMOBIL - VM CD 12.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    207 Dialogues, 207 Appointments, 2,154 Turns.
    1 CDROM.

    ELRA-S0034-18VERBMOBIL - VM CD 12.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    207 Dialogues, 207 Appointments, 2,154 Turns. This new edition contains the transliterations of all dialogues, signal files with PhonDat 2 Header structure, software and speaker documentation and partitur files. All files were validated according to BAS guidelines.
    1 CDROM.

    ELRA-S0034-20VERBMOBIL - VM CD 13.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    90 speakers, 1714 turns, 200 spontaneous dialogues, transliteration.
    1 CDROM.

    ELRA-S0034-21VERBMOBIL - VM CD 14.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    97 speakers, 1891 turns, 156 spontaneous dialogues, transliteration.
    1 CDROM.

    ELRA-S0034-22VERBMOBIL - VM CD 14.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    97 speakers, 1891 turns, 156 spontaneous dialogues, transliteration, PhonDat 2 headers, partitur files.
    1 CDROM.

    ELRA-S0034-23VERBMOBIL - VM CD 16.0 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    78 speakers, 3311 turns, 200 spontaneous dialogues, transliteration (Kanji/Kana and Roman/Latin).

    ELRA-S0034-24VERBMOBIL - VM CD 17.0 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    84 speakers, 2741 turns, 200 spontaneous dialogues, transliteration (Kanji/Kana and Roman/Latin).
    1 CDROM

    ELRA-S0034-25VERBMOBIL - VM CD 18.0 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    80 speakers, 2345 turns, 200 spontaneous dialogues, transliteration (Kanji/Kana and Roman/Latin).
    1 CDROM

    ELRA-S0034-26VERBMOBIL - VM CD 19.0 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    82 speakers, 2911 turns, 200 spontaneous dialogues, transliteration (Kanji/Kana and Roman/Latin).
    1 CDROM.

    ELRA-S0034-27VERBMOBIL - VM CD S 1.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    26 Free Dialogues (with overlap, stereo recordings), 2227 Turns.
    1 CDROM.

    ELRA-S0034-28VERBMOBIL II - VM CD15.1 - VM15.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 19 spontaneous dialogues (19 close mic, 19 room mic, 19 telephone (fixed network, GSM), 3117 turns, transliteration (VM II format), NIST headers, partitur files.
    1 CDROM.

    ELRA-S0034-29VERBMOBIL II - VM CD20.1 - VM20.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 30 spontaneous dialogues (10 close mic, 27 room mic, 10 phone line (GSM)), 1957 turns, transliteration (VM II format), NIST headers, partitur files.
    1 CDROM.

    ELRA-S0034-30VERBMOBIL II - VM CD21.1 - VM21.1 (new edition)
    Verbmobil II - German - 38 spontaneous dialogues (38 close mic, 2 room mic, 22 phone line (GSM)), 2331 turns, transliteration (VM II format), NIST headers, partitur files.
    1 CDROM.

    ELRA-S0034-31VERBMOBIL II - VM CD 22.1 - VM22.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 60 spontaneous dialogues (28 close mic, 5 room mic, 27 phone line (GSM) recordings), 2004 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-32VERBMOBIL II - VM CD 23.1 - VM23.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - American English - 28 spontaneous dialogues (28 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 2727 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-33VERBMOBIL II - VM CD 24.1 - VM24.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 58 spontaneous dialogues (36 close mic, 0 room mic, 22 phone line (GSM) recordings), 2231 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-34VERBMOBIL II - VM CD 25.1 - VM25.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese - 10 spontaneous dialogues (10 close mic, 0 room mic, 0 phone line (GSM) recordings), 1654 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-35VERBMOBIL II - VM CD 26.1 - VM26.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese - 16 spontaneous dialogues (16 close mic, 0 room mic, 0 phone line (GSM) recordings), 1319 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-36VERBMOBIL II - VM CD 27.1 - VM27.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese - 24 spontaneous dialogues (24 close mic, 0 room mic, 0 phone line (GSM) recordings), 1149 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-37VERBMOBIL II - VM CD 28.1 - VM28.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - American English - 28 spontaneous dialogues (28 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 2408 turns, transliteration (Verbmobil II Format)
    1 CDROM.

    ELRA-S0034-38VERBMOBIL II - VM CD 30.1 - VM30.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 33 spontaneous dialogues (33 close mic, 21 room mic, 25 phone line (fixed network, GSM) recordings), 4176 turns, transliteration (Verbmobil II Format)
    1 CDROM.

    ELRA-S0034-39VERBMOBIL II - VM CD 31.1 - VM31.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - American English - 32 spontaneous dialogues (32 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 2512 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-40VERBMOBIL II - VM CD 32.1 - VM32.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Multilingual - 17 spontaneous dialogues (17 close mic, 0 room mic, 0 phone line (fixed, network, GSM) recordings), 992 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-41VERBMOBIL II - VM CD 33.1 - VM33.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 25 spontaneous dialogues (25 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 1050 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-42VERBMOBIL II - VM CD 34.1 - VM34.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 28 spontaneous dialogues (28 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 1437 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-43VERBMOBIL II - VM CD 35.1 - VM35.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 27 spontaneous dialogues (27 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 1645 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-44VERBMOBIL II - VM CD 38.1 - VM38.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German, 33 spontaneous dialogues (33 close mic, 28 room mic, 28 phone line (fixed network, GSM) recordings), 5115 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-45VERBMOBIL II - VM CD 39.1 - VM39.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German, 28 spontaneous dialogues (28 close mic, 17 room mic, 20 phone line (fixed network, GSM) recordings), 3360 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-46VERBMOBIL II - VM CD 29.1 - VM29.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German, 25 spontaneous dialogues (25 close mic, 20 room mic, 20 phone line (fixed network, GSM) recordings), 2708 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-47VERBMOBIL II - VM CD 42.1 - VM42.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - American English, 20 spontaneous dialogues (20 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 1874 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-48VERBMOBIL II - VM CD 43.1 - VM43.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese - 24 spontaneous dialogues (24 close mic, 0 room mic, 0 phone line (GSM) recordings), 1149 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-49VERBMOBIL II - VM CD 49.1 - VM49.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German, 24 spontaneous dialogues (24 close mic, 12 room mic, 12 phone line (fixed network, GSM) recordings), 2597 turns, transliteration (Verbmobil II Format)..
    1 CDROM.

    ELRA-S0034-50VERBMOBIL II - VM CD 50.1 - VM50.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - American-English, 8 spontaneous dialogues (8 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 679 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-51VERBMOBIL II - VM CD 48.1 - VM48.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German, 28 spontaneous dialogues (28 close mic, 23 room mic, 27 phone line (fixed network, GSM) recordings), 4238 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-52VERBMOBIL II - VM CD 44.1 - VM44.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 19 spontaneous dialogues (19 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 920 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-53VERBMOBIL II - VM CD 45.1 - VM45.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 21 spontaneous dialogues (21 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 1293 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-54VERBMOBIL II - VM CD 46.1 - VM46.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Multilingual Japanese/German, 11 spontaneous dialogues (11 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 607 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-55VERBMOBIL II - VM CD 47.1 - VM47.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Multilingual with human interpreter (3 channels) English/German, 17 spontaneous dialogues (17 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings), 853 turns, transliteration (Verbmobil II Format).
    1 CDROM.

    ELRA-S0034-56VERBMOBIL II - VM Bonus CD - VMBONUS (BAS edition)
    Additional data and documentation that is not included in the regular VM volumes.
    1 CD-ROM.

    ELRA-S0034-57VERBMOBIL II - VM Lexicon database - VMLEX (BAS edition)
    Verbmobil lexicon database of the University of Bielefeld.

    ELRA-S0034-58VERBMOBIL II - VM CD 15.1 - VM15.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - German - 19 spontaneous dialogues (19 close mic, 19 room mic, 19 phone line (GSM)), 3117 turns, transliteration (VM II format), NIST headers, partitur files.
    1 CDROM.

    ELRA-S0034-59VERBMOBIL II - VM CD 16.1 - VM16.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 200 dialogues, 200 appointment schedulings - 3311 turns.
    1 CDROM.

    ELRA-S0034-60VERBMOBIL II - VM CD 17.1 - VM17.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Verbmobil II - Japanese, 200 dialogues, 200 appointment schedulings - 2741 turns.
    1 CDROM.

    ELRA-S0034-61VERBMOBIL II - VM CD 18.1 - VM18.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Japanese, 200 dialogues, 200 appointment schedulings - 2345 turns.
    1 CDROM.

    ELRA-S0034-62VERBMOBIL II - VM CD 19.1 - VM19.1 (new edition)
    Spontaneous speech databases recorded in a dialogue task.
    Japanese, 200 dialogues, 200 appointment schedulings - 2911 turns.
    1 CDROM.

    ELRA-S0034-63VERBMOBIL II - VM CD 53.1 - VM53.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    German, 16 spontaneous dialogues (16 close mic, 8 room mic, 8 phone line (GSM) recordings) - 1771 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-64VERBMOBIL II - VM CD 60.1 - VM60.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Japanese - 10 spontaneous dialogues (10 close mic, 0 room mic, 0 phone line (GSM) recordings) - 501 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-65VERBMOBIL II - VM CD 61.1 - VM61.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Japanese - 19 spontaneous dialogues (19 close mic, 0 room mic, 0 phone line (GSM) recordings) - 946 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-66VERBMOBIL II - VM CD 62.1 - VM62.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Japanese - 21 spontaneous dialogues (21 close mic, 0 room mic, 0 phone line (GSM) recordings) - 981 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-67VERBMOBIL II - VM CD 51.1 - VM51.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/English with human interpreter (3 channels) - 15 spontaneous dialogues (15 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) – 856 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-68VERBMOBIL II - VM CD 52.1 - VM52.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/English with human interpreter (3 channels) - 13 spontaneous dialogues (13 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 728 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-69VERBMOBIL II - VM CD 55.1 - VM55.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/English with human interpreter (3 channels) - 11 spontaneous dialogues (11 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 518 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-70VERBMOBIL II - VM CD 56.1 - VM56.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/English with human interpreter (3 channels) - 12 spontaneous dialogues (12 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 620 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-71VERBMOBIL II - VM CD 57.1 - VM57.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/Japanese with 2 human interpreters (4 channels) - 11 spontaneous dialogues (11 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 702 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-72VERBMOBIL II - VM CD 58.1 - VM58.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/Japanese with 2 human interpreters (4 channels) - 7 spontaneous dialogues (7 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 421 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-73VERBMOBIL II - VM CD 59.1 - VM59.1 (BAS edition)
    Spontaneous speech databases recorded in a dialogue task.
    Multilingual German/Japanese with 2 human interpreters (4 channels) - 7 spontaneous dialogues (7 close mic, 0 room mic, 0 phone line (fixed network, GSM) recordings) - 354 turns, transliteration (VM II Format).
    1 CDROM.

    ELRA-S0034-74VERBMOBIL II - VM CD 63.0 - VM63.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    German - 14 WOZ dialogues designed to evoke emotions (mainnly anger) - transliteration, emotion labeling.
    1 CDROM.

    ELRA-S0034-75VERBMOBIL II - VM CD 64.0 - VM64.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    German - 13 WOZ dialogues designed to evoke emotions (mainnly anger) - transliteration, emotion labeling.
    1 CDROM.

    ELRA-S0034-76VERBMOBIL II - VM CD 65.0 - VM65.0 (original edition)
    Spontaneous speech databases recorded in a dialogue task.
    German - 13 WOZ dialogues designed to evoke emotions (mainnly anger) - transliteration, emotion labeling.
    1 CDROM.

    ELRA-S0035PHONOLEX (BAS/DFKI)
    Approx. 1,6 Mio entries with orthographic forms (capital nouns, old German, spelling, ...), phonetic transcription (by rules and exception list) and other linguistic information (e.g. grammatical categories).

    ELRA-S0038Siemens VoiceMail
    This speech database contains the recordings of 921 American speakers recorded over the fixed telephone network. It consists of read acoustic speech divided into 9.5 hours of transliterated speech and 8 hours of non-transliterated speech. Orthographic transliteration for about 25,000 utterances are included.

    ELRA-S0039APASCI
    Italian acoustic database recorded in an insulated room. It includes ca. 16,090 utterances and digits, 58,924 words (2,191 different words), 641 minutes of speech. The data is uttered by 50 male and 50 female speakers. 42 male and 12 female speakers repeated 20 times 10 isolated digits.

    ELRA-S0040Danish SpeechDat(M) database - DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc..
    This speech database contains the recordings of 1,523 Danish speakers, recorded over the Danish fixed telephone network. Each speaker uttered around 100 read and spontaneous items.

    ELRA-S0041Danish SpeechDat(M) database - DB2
    Phonetically rich sentences sub-set.
    See ELRA-S0040

    ELRA-S0042POLYCOST
    This large speech corpus of English spoken by foreigners contains the recordings of 133 speakers recorded over the fixed telephone network. A total of 1,285 calls have bee recorded (10 sessions per speaker). Each speaker uttered prompted items in English.

    ELRA-S0043ONOMASTICA-COPERNICUS DATABASE
    COP-58 project (EU Copernicus Programme).A collection of 1,783,390 transcriptions of 1,705,653 Eastern and Central European proper and place names.

    ELRA-S0044SPINA Corpus ("Robots Commands")
    10 sentences and 62 commands from the robot control domain spoken by 22 speakers in 5 versions, phonological segmentation (words), word segmentation (sentences).

    ELRA-S0045German Pronunciation Rules Set - PHONRUL 9.0
    This set of computer-readable rules describes the most common known effects in German pronunciation if deviating from the so-called canonical or citation form of words. The knowledge of this rule set was derived from empirical analysis of speech corpora and a multitude of publications about German phonetics.

    ELRA-S0046PolyVar
    PolyVar is a speaker verification database. It consists of 143 speakers with 3600 recorded sessions. All speakers did not record the same number of sessions. The format in use is NIST (a-law).
    See also ELRA-S0047.

    ELRA-S0047SpeechDat Speaker Verification database
    This subset of PolyVar (cf. ELRA-S0046) consists of 20 speakers which recorded 50 sessions. The format in use is SAM (a-law).

    ELRA-S0048SIelex (Siemens Phonetic lexicon)
    186,600 entries, including proper names, place names, no-native entries and abbreviations, with phonetic transcriptions, main stress markers and syllable boundary markers, from the political and economical parts of the German newspapers 'Suddeutsche Zeitung' and 'Frankfurter Allgemeine Zeitung'.

    ELRA-S0049SPK
    SPK is an Italian speech database of isolated digits acquired from 100 speakers (30 females and 70 males, from 23 to 50 years old).

    ELRA-S0050Russian Speech Database
    Russian read speech of 89 different speakers (54 male, 45 female), including 70 speakers with 15 sessions or more, 10 speakers with 10 sessions or more, and 9 speakers with less than 10 sessions, recorded through a 16-bit Vibra-16 Creative Labs sound card

    ELRA-S0051German SpeechDat(II) FDB-1000
    988 calls made within the SpeechDat(II) project. Examples of items are: isolated and connected digits, telephone number, credit card number, PIN code, natural numbers, money amounts, spelled words, time of day, time phrase, dates, yes/no questions, common application words. All application words are recorded more than 80 times. These are: 1 application word phras, 9 phonetically rich sentences (read), 4 phonetically rich words (read), 5 directory assistance names.

    ELRA-S0052FIXED0IT - DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc. Italian SpeechDat(M) (Polyphone) database from 1,000 callers over the telephone in Italy.

    ELRA-S0053FIXED0IT - DB2
    Phonetically rich sentences sub-set.
    See ELRA-S0052

    ELRA-S0054Siemens Chile Spanish FDB-250
    This speech corpus contains the recordings of 507 speakers. Each speaker uttered a total of 24 utterances in Spanish as spoken in Chile. It consists of read speech, including digits and application words for teleservices. They were recoded over the fixe telephone network, through an ISDN card.

    ELRA-S0055Siemens Russian SpeechDat-like FDB-1000
    Russian read and spontaneous speech, recorded through an ISDN card, and validated according to the SpeechDat(II) database exchange format. The whole database consists of 72 hours of speech, with 49 prompted utterances recorded by 1000 speakers (500 male, 500 female).

    ELRA-S0056Slovenian SpeechDat(II) FDB-1000
    Read and spontaneous speech, recorded through an ISDN card, and validated according to the SpeechDat(II) database exchange format. The corpus includes 1000 speakers (500 male, 500 female) who called over the Slovenian fixed network.

    ELRA-S0057Siemens Shanghai Mandarin FDB-1000
    Mandarin data, as spoken in Shanghai as a first or second language. The corpus consists of read speech, including digits and application words for teleservices, recorded through an ISDN card. A total of 70 utterances was prompted by 1000 speakers (500 male, 500 female).

    ELRA-S0058RVG1 (Regional Variants of German 1, Part 1)
    The corpus consists of single digits, connected digits, phone numbers, phonetically balanced sentences, computer command phrases and spontaneous speech. Each of the 498 speakers has read a subcorpus of 85 items: RVG1, Part 1 contains 498 speakers recorded through low quality microphones. RVG1, Part 2, contains 421 speakers recorded through high quality microphones.

    ELRA-S0059ILE: Italian LExicon
    ILE is a 588,000 entries Italian lexicon transcribed with SAMPA notation. The morpho-lexicon was obtained by processing an Italian dictionary, and adding by hand all possible inflections. The base lexicon is enriched with names and neologisms found in the 65,000 most frequent words of the newspaper "Il Sole 24 Ore", and the most frequent Italian proper names and surnames (from the telephone directory), geographical names, acronyms, company names, commonly used foreign words. A total of about 601,000 different transcriptions are provided for the 588,000 words lexicon.

    ELRA-S0060MULTEXT Prosodic database
    This database comprises one CD-ROM for each five languages (French, English, Italian, German and Spanish), totalling 4 hours and 20 minutes of speech and involving 50 different speakers (5 male and 5 female per language). The recordings on which the corpus is based consist of passages of about five sentences extracted from the EUROM.1 speech corpus ("Esprit 2589 project Multi-lingual Speech Input/output Assessment, Methodology and Standardisation").

    ELRA-S0061French Speechdat(II) FDB-1000 (Matra Nortel Communications)
    This French telephone speech database is designed for development and assessment of French speech recognizers. It contains 48 utterances (40 mandatory and 8 optional items) for 1,017 different speakers, collected over the fixed telephone network.

    ELRA-S0061French Speechdat(II) FDB-1000
    This French telephone speech database is designed for development and assessment of French speech recognizers. It contains 48 utterances (40 mandatory and 8 optional items) for 1,017 different speakers, collected over the fixed telephone network.

    ELRA-S0062Fixed1it Design
    Textual material used within the Italian SpeechDat(M) and SpeechDat(II) Databases. It contains prompted text read by speakers in the supplied sheet, orthographic transcription, statistics, lexicon. The CD-ROM does not contain any recordings.

    ELRA-S0063German SpeechDat(II) FDB-4000
    Read and spontaneous speech, recorded through an ISDN card, and validated according to the SpeechDat(II) database exchange format. The corpus includes 4000 speakers who called over the German fixed network.

    ELRA-S0064Colombian Spanish Speech Database
    This database contains telephone recordings from 1065 speakers (563 males speakers and 502 female speakers) recorded directly over the fixed telephone network.

    ELRA-S0065Spanish SpeechDat(M) - DB1
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc.
    This database is comprised of telephone recordings from 1002 speakers (508 male speakers and 494 female speakers) recorded directly over the fixed telephone network

    ELRA-S0066Spanish SpeechDat(M) - DB2
    Phonetically rich sentences.
    Sub-set of ELRA-S0065

    ELRA-S0067BREF-120 - A large corpus of French read speech
    Large read-speech corpus containing over 100 hours of speech material, from 120 speakers (55 males and 65 females). The text materials were selected verbatim from extracts of the French newspaper "Le Monde".

    ELRA-S0068Portuguese SpeechDat(M) database
    This speech database contains the recordings of 1,001 speakers (453 males, 548 females) recorded over the Portuguese fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0069Swedish SpeechDat(II) FDB-5000
    This speech database contains the recordings of 5,000 Swedish speakers recorded over the Swedish fixed telephone network. Each speaker uttered around 40 read and spontaneous items, and further items were recorded for speaker verification purposes and dialectal studies.

    ELRA-S0070Swedish SpeechDat(II) FDB-1000
    This speech database contains the recordings of 1,000 Swedish speakers recorded over the Swedish fixed telephone network. Each speaker uttered around 40 read and spontaneous items, and further items were recorded for speaker verification purposes and dialectal studies.

    ELRA-S0071Swedish SpeechDat(II) MDB-1000
    This speech database contains the recordings of 1,000 Swedish speakers recorded over the Swedish mobile telephone network. Each speaker uttered around 40 read and spontaneous items, and further items were recorded for speaker verification purposes and dialectal studies.

    ELRA-S0072Danish SpeechDat(II) FDB-1000
    This speech database contains the recordings of 1,000 Danish speakers recorded over the Danish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0073Danish SpeechDat(II) FDB-4000
    This speech database contains the recordings of 4,000 Danish speakers recorded over the Danish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0074British English SpeechDat(II) MDB-1000
    This speech database contains the recordings of 1,000 British speakers recorded over the British mobile telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0075Welsh SpeechDat(II) FDB-2000
    This speech database contains the recordings of 2,000 Welsh speakers recorded over the British fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0076French Speechdat(II) FDB-5000 (Matra Nortel Communications)
    This database comprises 5040 French speakers recorded over the French fixed telephone network.

    ELRA-S0076French Speechdat(II) FDB-5000 database
    FIXED1FR:  This speech database contains the recordings of 5,040 French speakers recorded over the French fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0077Telephone Speech Data Collection for Czech
    This database comprises telephone recordings from 1227 speakers recorded over the Czech fixed telephone network.

    ELRA-S0078Finnish Speechdat(II) FDB-1000
    This speech database contains the recordings of 1,000 Finnish speakers recorded over the Finnish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0079Finnish Speechdat(II) FDB-4000
    This speech database contains the recordings of 4,000 Finnish speakers recorded over the Finnish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0080Finnish-Swedish Speechdat(II) FDB-1000
    This speech database contains the recordings of 1,000 Finnish speakers recorded over the Finnish fixed telephone network. Each speaker uttered around 40 read and spontaneous items, in the variant of Swedish spoken in Finland.

    ELRA-S0081Norwegian SpeechDat(II) FDB-1000
    This speech database contains the recordings of 1,016 Norwegian speakers recorded over the Norwegian fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0082Siemens Synthesis Corpus - SI1000P
    The SI1000P recordings were done to provide material for high quality concatenative speech synthesis. It contains 1000 newspaper sentences (SI1000 newspaper corpus) read by two professional broadcasting announcers in studio quality together with the laryngographic signal and the glottal pulse stream. Parts of the corpus were labeled and segmented phonemically (SAM-PA) and prosodically (boarders + accents).

    ELRA-S0083ISLE Speech Corpus
    Approx. 20 minutes of speech (per speaker) from 23 German and 23 Italian intermediate learners of English. Each speaker recorded sentences from several blocks of differing types (reading simple sentences, using minimal pairs, giving answers to multiple choice questions). About 2/3 of the data for each speaker was annotated by linguists. The files were corrected first at the word level, and an automatic recognizer was then used to produce phone-level annotations. The annotator then re-annotated each sentence to mark phone and stress errors (e.g., substitutions, insertions, or deletions).

    ELRA-S0084SALA Spanish Colombian Database
    This speech database contains the recordings of 1,000 Colombian speakers recorded over the Colombian fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0085BABEL Bulgarian Database
    The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304).
    The Bulgarian database consists of:
    - the basic "common" set which contains the Many Talker Set (30 males, 30 females), Few Talker Set (5 males, 5 females), Very Few Talker Set (1 male, 1 female);
    - and the extension part: semi-spontaneous answers to questions: the answers were recorded by the 10 Few Talker Set speakers.

    ELRA-S0086BABEL Estonian Database
    The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304).
    The Estonian database consists of:
    - the basic "common" set which contains the Many Talker Set (30 males, 30 females), Few Talker Set (4 males, 4 females), Very Few Talker Set (1 male, 1 female);
    - and the extension part: a short description of Estonian sound system.

    ELRA-S0087BABEL Hungarian Database
    The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304).
    The Hungarian database consists of:
    - the basic "common" set which contains the Many Talker Set (30 males, 30 females), Few Talker Set (4 males, 4 females), Very Few Talker Set (1 male, 1 female);
    - and the extension part: a short description of Hungarian sound system.

    ELRA-S0088Twin database - TWINDB1
    The Twin database named TWINDB1 contains the recordings of 45 French speakers: 9 pairs of identical twins (8 males and 10 females) with similar voices, and 27 other speakers (13 males and 14 females) including 4 none-twin siblings.

    ELRA-S0089Albayzin corpus
    This corpus consists of 3 sub-corpora of 16 kHz 16 bits signals, recorded by 304 Castillian speakers: Phonetic corpus, Geographic corpus, "Lombard" corpus.

    ELRA-S0090Polish SpeechDat(E) Database
    This speech database contains the recordings of 1,000 Polish speakers recorded over the Polish fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0091Pronunciation lexicon of British place names, surnames and first names
    This pronunciation lexicon produced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335) is an SGML-encoded database. It contains 160,000 entries of British place-names, surnames and first names All phonemic transcriptions in the database are based on the SAMPA phonetic alphabet.

    ELRA-S0092Portuguese SpeechDat(II) FDB-4000
    This speech database contains the recordings of 4,027 Portuguese speakers recorded over the Portuguese fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0093IBNC - An Italian Broadcast News Corpus
    Produced through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335), the collection consists of 150 broadcast programs from the RAI, for a total time of about 30 hours, issued in 36 different days, between 1992 and 1999. down-sampled to 16kHz 16 bit, and encoded into the NIST Sphere PCM format.

    ELRA-S0094Czech SpeechDat(E) Database
    FIXED3CS:  This speech database contains the recordings of 1,052 Czech speakers recorded over the Czech fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0094Czech SpeechDat(E) Database (Matra Nortel Communications)
    The Czech SpeechDat(E) database comprises 1052 Czech speakers (526 males, 526 females) recorded over the Czech fixed telephone network.

    ELRA-S0095Slovak SpeechDat(E) Database (Matra Nortel Communications)
    The Slovak SpeechDat(E) database comprises 1000 Slovak speakers (498 males, 502 females) recorded over the Slovak fixed telephone network.

    ELRA-S0095Slovak SpeechDat(E) Database
    FIXED3SK:  This speech database contains the recordings of 1,000 Slovak speakers recorded over the Slovak fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0096German SpeechDat(II) MDB-1000
    This speech database contains the recordings of 1,295 German speakers recorded over the German mobile telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0097British English SpeechDat(II) FDB-4000
    This speech database contains the recordings of 4,000 British speakers recorded over the British fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0098British English SpeechDat(II) SDB-2400
    This speech database contains the recordings of 120 British speakers recorded over the British fixed and mobile telephone network. Each speaker called 20 times and uttered 22 items each time.

    ELRA-S0099Russian SpeechDat(E) Database
    This speech database contains the recordings of 2,500 Russian speakers recorded over the Russian fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0100MHATLex
    Lexicon for written and spoken French including 440,000 inflected forms with spelling, contextual variants at morphological, phonological and phonetic levels, and morphosyntatic attributes.

    ELRA-S0101Spanish SpeechDat(II) FDB-1000
    This speech database contains the recordings of 1,000 Castillan Spanish speakers recorded over the Spanish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.
    This database is a subset of the Spanish SpeechDat(II) FDB-4000 (ref. ELRA-S0102).

    ELRA-S0102Spanish SpeechDat(II) FDB-4000
    This speech database contains the recordings of 4,000 Castillan Spanish speakers recorded over the Spanish fixed telephone network. Each speaker uttered around 40 read and spontaneous items.
    This database includes the Spanish SpeechDat(II) FDB-1000 (ref. ELRA-S0101).

    ELRA-S0103Swiss-French SpeechDat(M)
    Phonetically rich sentences & application oriented utterances such as keywords, digits, etc.
    This speech database contains the recordings of Swiss-French speakers recorded over the fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0104Swiss-French SpeechDat(II) FDB-3000
    This speech database contains the recordings of 3,000 Swiss-French speakers recorded over the Swiss fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0105Swiss-German SpeechDat(II) FDB-2000
    This speech database contains the recordings of 2,000 Swiss-German speakers recorded over the Swiss fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0106Dutch SpeechDat(II) MDB-250
    This speech database contains the recordings of 250 speakers recorded over the Dutch mobile telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0107Flemish SpeechDat(II) FDB-1000
    FIXED1FL:  This speech database contains the recordings of 1,023 Flemish speakers recorded directly over the Belgian fixed telephone network. Each speaker uttered around 50 items, which he/she repeated 5 times.

    ELRA-S0108Belgian-French SpeechDat(II) FDB-1000
    FIXED1BF:  This speech database contains the recordings of 1,011 Belgian-French speakers recorded over the Belgian fixed telephone network. Each speaker uttered around 50 items, which he/she repeated 2 times.

    ELRA-S0109Luxembourgish-French SpeechDat(II) FDB-500 database
    FIXED1LF:  This speech database contains the recordings of 614 Luxembourgish-French speakers recorded over the Luxembourgish fixed telephone network. Each speaker uttered around 50 items, which he/she repeated 3 times.

    ELRA-S0110Luxembourgish-German SpeechDat(II) FDB-500
    FIXED1LG:  This speech database contains the recordings of 560 Luxembourgish-German speakers recorded over the Luxembourgish fixed telephone network. Each speaker uttered around 50 items.

    ELRA-S0111Eleftherotypia Journal Speech database
    The Eleftherotypia Journal speech database consists of Greek read material. It includes the recordings of 120 speakers, male and female, for about 72 hours of speech material.

    ELRA-S0112Farsdat (Farsi Speech Database)
    The Persian Speech Database comprises the recordings of 300 native speakers, from 10 different dialect regions of Iran. 6000 utterances were segmented and labelled, including 386 phonetically balanced sentences.

    ELRA-S0113Spoken Dutch Corpus
    CGN:  The Spoken Dutch Corpus (CGN) is a database of contemporary Dutch as spoken by adults in the Netherlands and Flanders. The corpus contains approximately 9 million words, two thirds of which originate from the Netherlands and one third from Flanders.

    ELRA-S0114Strange Corpus 10 - SC10 ('Accents II')
    70 speakers (67 non-native, 3 native German speakers) - 1 dialogue, 1 re-telling of a German story - transliteration, orthography, canonical transcription.

    ELRA-S0115American English SpeechDat-Car
    The American English SpeechDat-Car database comprises recordings in a car of 314 speakers (150 males, 164 females), who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0116Italian SpeechDat(II) MDB-250
    This speech database contains the recordings of 375 Italian speakers recorded over the Italian mobile telephone network. Each speaker uttered around 50 items.

    ELRA-S0117Italian SpeechDat(II) FDB-3000
    This speech database contains the recordings of 3,000 speakers recorded over the Italian fixed telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0118Greek SpeechDat(II) FDB-5000
    This speech database contains the recordings of 5,000 speakers recorded over the Greek fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0119Spanish SpeechDat Database for the Mobile Telephone Network
    This speech database contains the recordings of 1,066 Spanish speakers who called from GSM telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0120Translanguage English Database (TED) Transcripts database
    The Translanguage English Database (TED) Transcripts corpus contains transcriptions of thirty-nine of the 188 presentations contained in the TED corpus, made at Eurospeech'93 in Berlin. Ref ELRA S0031

    ELRA-S0121Turkish Continuous and Isolated Word Speech Database
    The Turkish speech database contains 14 hours of read speech (1618 words) from 43 Turkish speakers (adults over 18; 22 males, 21 females).

    ELRA-S0122German SpeechDat-Car
    This speech database contains the recordings in a car of 338 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0123Basque Spoken Corpus, by Jon Aske (Department of Foreign Languages, Salem State College - Salem, Massachusetts, USA)
    The Basque spoken corpus is a collection of 42 narratives by native Basque speakers, who relate a silent movie they have just watched to someone else.

    ELRA-S0124Phonetically Balanced Words (1)
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 452 Korean terms (known as eojeols) read by 70 speakers. Two more announcers read 2000 terms.

    ELRA-S0125Phonetically Balanced Words (2)
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 36 geographical nouns read by Korean speakers.

    ELRA-S0126Phonetically Balanced Words (3)
    This large acoustic corpus in Korean produced by Kaist Korterm consists of a whole paragraph read by 70 Korean speakers. Two more announcers also read the same paragraph.

    ELRA-S0127Phonetically Balanced Words (4)
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 32 cardinal numbers and 9 determinative of one syllable repeated 4 times by 70 Korean speakers. Two more announcers read these only 2 times..

    ELRA-S0128Phonetically Balanced Words (5)
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 35 cardinal numbers compound of 4 single numbers read 4 times by 70 Korean speakers. Two more announcers read these only 2 times.

    ELRA-S0129Phonetically Balanced Sentences
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 539 sentences and a set of 50 common sentence read by 20 native Korean speakers.

    ELRA-S0130Phonetically Rich Words
    This large acoustic corpus in Korean produced by Kaist Korterm consists of 32 single cardinal numbers, 1620 cardinal numbers compound of 4 single numbers and 3813 phonetically rich words uttered by 500 Korean speakers (250 males, 250 females), by telephone (fixed, wireless, mobile).

    ELRA-S0131British-English SpeechDat-Car
    This speech database contains the recordings in a car of 300 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0132-01Danish SpeechDat-Car - Full database
    This speech database contains the recordings in a car of 300 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0132-02Danish SpeechDat-Car - GSM recordings - GSM recordings only
    This speech database contains the recordings in a car of 300 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0132-03Danish SpeechDat-Car - In-car recordings
    This speech database contains the recordings in a car of 300 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0133Finnish SpeechDat-Car
    This speech database contains the recordings in a car of 302 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0134Concise Oxford Dictionary - Audio Files
    The "acoustic dictionary" contains 60,000 soundfiles recorded from the Concise Oxford Dictionary, with the British-English pronunciation. The format in use is 22kHz 16-bit WAV.

    ELRA-S0135French SpeechDat-Car
    VEHIC1FR:  This speech database contains the recordings in a car of 313 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0136SmartKom Public
    SKP:  Release SKP 2.0 contains 172 recordings in the technical setup (“scenario”) SmartKom Public which is comparable to a traditional public phone booth but equipped with additional intelligent communication devices. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4.5 minutes while they were left alone with the system.

    ELRA-S0137TAXI - Multilingual telephone dialog database
    TAXI contains 94 recorded dialogues between a cab dispatcher and a client recorded over public phone lines (network and GSM). The dispatcher always spoke German, while the clients always spoke English (spontaneous speech).

    ELRA-S0138Cantonese SpeechDat-like MDB-2000
    This speech database contains the recordings of 2,000 speakers recorded over the mobile telephone network in China and Hong Kong. The database follows the specifications given in the framework of the SpeechDat project.

    ELRA-S0139Flemish/Dutch SpeechDat-Car database
    VEHIC1NV:  The Flemish and Dutch SpeechDat-Car database contains the recordings in a car of 302 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0140Spanish SpeechDat-Car database
    The Spanish SpeechDat-Car database contains the recordings in a car of 306 speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0141SALA Spanish Venezuelan Database
    This speech database contains the recordings of 1,000 Venezuelan speakers recorded over the Venezuelan fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0142Austrian SpeechDat(AT) FDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the fixed telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-S0143Austrian SpeechDat(AT) MDB-1000 database
    This speech database contains the recordings of 1,000 Austrian speakers recorded over the Austrian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-S0144Italian SpeechDat-Car database
    The Italian SpeechDat-Car database contains the recordings in a car of 300 Italian speakers, who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0145Mandarin-5000 database
    This speech database contains the recordings of 4,752 speakers of Mandarin as first or second language recorded over the fixed and mobile telephone networks in all provinces of mainland China, including Hong Kong. Each speaker uttered around 54 read and spontaneous items.

    ELRA-S0146Greek SpeechDat-Car
    This speech database contains the recordings in a car of 300 speakers (150 males, 150 females), who uttered around 120 read and spontaneous items. Recordings have been made through 5 different channels, of which 4 were in-car microphones (1 close-talk microphone, 3 far-talk microphones) and 1 channel over the GSM network.

    ELRA-S0147Italian Speech Corpus 1 (Appen)
    The Italian Speech Corpus contains the recordings of 202 native Italian speakers recorded in an office and a closed public place, over 4 channels, in a range of low to medium background noise environments.

    ELRA-S0148Italian TTS Speech Corpus (Appen)
    The Italian TTS Speech Corpus contains the recordings of 1 native Italian speaker recorded in a studio over 1 channel.

    ELRA-S0149Spanish Speech Corpus 1 (Appen)
    The Spanish Speech Corpus 1 contains the recordings of 200 native Spanish speakers recorded in an office and a closed public place, over 4 channels, in a range of low to medium background noise environments.

    ELRA-S0150Spanish TTS Speech Corpus (Appen)
    The Spanish TTS Speech Corpus contains the recordings of 1 native Spanish speaker recorded in a studio over 1 channel.

    ELRA-S0151Strange Corpus 2 - SC2 (Noises)
    8000 utterances read by 10 male speakers in two car maintenance halls with a variety of real noise in the background. Noises are manually labelled in the data qualifying this corpus especially for experiments with noise detection, noise cancellation and robust speech recognition.

    ELRA-S0152Basque FDB-1060 database (SpeechDat-like)
    The Basque FDB-1060 database contains the recordings of 1,060 speakers of Basque recorded over the fixed telephone network. Each speaker uttered around 43 read and spontaneous items.

    ELRA-S0153Bizkaifon (Bizkaieraren Fonoteka)
    Bizkaifon contains sound archives and associated information of dialectal varieties of spoken Basque. It consists of 21 hours of spontaneous and read speech, recorded over a microphone in a room, with orthographic transcription.

    ELRA-S0154WEBCOMMAND
    WEBCOMMAND contains recording sessions of 49 native speakers of France and Great Britain, most of whom read 260 items in two different quiet office rooms. Speakers were recorded with a high quality headset and a high quality microphone fixed to a 'webpad' held on the lap. The corpus contains a total of 15,600 two-channel recordings in 120 sessions. The database is conformant with the SpeechDat Exchange Format.

    ELRA-S0155RVG-J (Regional Variants of German J)
    This corpus contains 21,691 recordings in quiet living room acoustics of 182 adolescents (13-20) living in the German state Bavaria.

    ELRA-S0156ANITA (Audio eNhancement In Telecom Applications)
    ANITA (Audio eNhancement In secured Telecommunication Applications) consists of 41 recordings (17 males and 24 females) stored on 13 CDs. It consists of voice recordings in 4 languages (English, French, German and Spanish), noise recordings (sirens, engines, roadworks, crowds, trains, etc.), and real condition recordings (voices and mixed noises), in English. Each language consists of 60 phonetically rich sentences (normal and stress and in panic conditions), letters and numerals (normal and stress and in panic conditions) and a 10 minute text (normal conditions).

    ELRA-S0157NetDC Arabic BNSC (Broadcast News Speech Corpus)
    The NetDC Arabic BNSC (Broadcast News Speech Corpus) is a corpus developed by ELDA in the framework of the European-funded project Network of Data Centres (NetDC). The project was done in collaboration with the LDC (Linguistic Data Consortium), which has produced a similar corpus from the news broadcasted by Voice of America Arabic in the United States. The database contains ca. 22.5 hours of broadcast news speech recorded from Radio Orient (France) during a 3-month period.

    ELRA-S0158OrienTel Turkish database
    This speech database contains the recordings of 1,700 Turkish speakers recorded over the Turkish fixed and mobile telephone network. Each speaker uttered around 45 read and spontaneous items.

    ELRA-S0159German spoken by Turkish OrienTel database
    This speech database contains the recordings of 332 Turkish speakers of German recorded over the German fixed and mobile telephone network. Each speaker uttered around 53 read and spontaneous items.

    ELRA-S0160Spanish Speecon database
    The Spanish Speecon database comprises the recordings of 561 adult Spanish speakers and 55 child Spanish speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0161Russian Speecon database
    The Russian Speecon database comprises the recordings of 550 adult Russian speakers and 50 child Russian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0162Hempel
    This corpus contains 25.5 hours of recordings by 3,909 German speakers with a total of 184,240 spoken words, made via public phone lines (fixed network only). The contents are free monologues answering the question: "Was haben Sie in der letzten Stunde gemacht?" (What did you do within the last hour?). The database is conformant with the SpeechDat Exchange Format.

    ELRA-S0163ILPho phonetic lexicon
    ILPho:  Phonetic lexicon containing 39,000 lemmas (319,318 entries), distributed in 2 formats: Multext (with an extra column for phonetic transcriptions) and XML format.

    ELRA-S0164BAS GEO1
    The BAS GEO1 database is a simple database about the most important location names in Germany, Austria and Switzerland, together with their pronunciation coded in SAMPA. Future updates will be distributed to all users automatically.

    ELRA-S0165MICROAES
    MICROAES is a Spanish microphone database, which comprises the recordings from 300 different speakers (a total of 30 hours of speech). Each speaker recorded a corpus of 450 paragraphs in a quiet environment. The database includes an orthographic and lexical transcription, with a few details that represent audible acoustic events (speech and non speech) present in the corresponding waveform files. The lexicon has more than 7400 words with the corresponding pronunciation information in SAMPA.

    ELRA-S0166Fixed1frDesign
    Textual material used within the French SpeechDat(II) Database. It contains prompted text read by speakers in the supplied sheet, orthographic transcription, statistics, lexicon. The CD-ROM does not contain any recordings.

    ELRA-S0167SALA II Spanish Mobile Network Database collected in Venezuela
    The SALA II Spanish Mobile Network Database collected in Venezuela comprises 1179 Venezuelan speakers (576 males, 603 females) recorded over the Venezuelan mobile telephone network.

    ELRA-S0168French Speecon database
    The French Speecon database comprises the recordings of 550 adult French speakers and 50 child French speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0169Hebrew Speecon database
    The Hebrew Speecon database comprises the recordings of 550 adult Hebrew speakers and 50 child Hebrew speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0170BABEL Romanian database
    The BABEL Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). The Romanian database consists of the basic "common" set which contains the Many Talker Set (50 males, 50 females), the Few Talker Set (5 males, 5 females), and the Very Few Talker Set (1 male, 1 female).

    ELRA-S0171SALA II Spanish from Mexico database
    The SALA II Spanish from Mexico database comprises 1075 Mexican speakers (539 males, 536 females) recorded over the Mexican mobile telephone network.

    ELRA-S0172C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT
    C-ORAL-ROM:  C-ORAL-ROM is a multilingual corpus which consists of four comparable recording collections of French, Italian, Portuguese, and Spanish spontaneous speech sessions. It contains around 1,200,000 words (around 300,000 words per language) and provides the acoustic source of each session together with the orthographic transcription, session metadata, and text to speech synchronization, in Win Pitch Corpus format. The multimedia corpus comes with the speech software Win Pitch Corpus.

    ELRA-S0173SALA Spanish Mexican Database
    The SALA Spanish Mexican Database comprises 1260 Mexican speakers (554 males, 706 females) recorded over the Mexican fixed telephone network.

    ELRA-S0174-01FASiL English unimodal “fasil-uk” corpus
    fasil-uk:  This English corpus was collected within the FASiL project. It contains wizard-of-oz sound recordings of 70 subjects. See also S0174-02, S0174-03, S0174-04, and S0174-05.

    ELRA-S0174-02FASiL Portuguese unimodal “fasil-pt” corpus
    fasil-pt:  This Portuguese corpus was collected within the FASiL project. It contains wizard-of-oz sound recordings of 70 subjects. See also S0174-01, S0174-03, S0174-04, and S0174-05.

    ELRA-S0174-03FASiL Swedish unimodal “fasil-sv” corpus
    fasil-sv:  This Swedish corpus was collected within the FASiL project. It contains wizard-of-oz sound recordings of 70 subjects. See also S0174-01, S0174-02, S0174-04, and S0174-05.

    ELRA-S0174-04FASiL combined unimodal “fasil-all” corpus
    fasil-all:  This corpus was collected within the FASiL project. It contains wizard-of-oz sound recordings of 70 subjects per language (English, Portuguese and Swedish). See also S0174-01, S0174-02, S0174-03, and S0174-05.

    ELRA-S0174-05FASiL multimodal “fasil-mm” corpus
    fasil-mm:  This corpus was collected within the FASiL project. It contains wizard-of-oz sound and interaction recordings of 90 subjects (30 per language: English, Portuguese and Swedish). See also S0174-01, S0174-02, S0174-03, and S0174-04.

    ELRA-S0175Mandarin Chinese Speecon database
    The Mandarin Chinese Speecon database comprises the recordings of 550 adult Chinese speakers and 50 child Chinese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0176Finnish Speecon database
    The Finnish Speecon database comprises the recordings of 550 adult Finnish speakers and 50 child Finnish speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0177Korean Speecon database
    The Korean Speecon database comprises the recordings of 568 adult Korean speakers and 58 child Korean speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0178Turkish Speecon database
    The Turkish Speecon database comprises the recordings of 550 adult Turkish speakers and 50 child Turkish speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0179Polish Speecon database
    The Polish Speecon database comprises the recordings of 550 adult Polish speakers and 50 child Polish speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0180Portuguese Speecon database
    The Portuguese Speecon database comprises the recordings of 553 adult Portuguese speakers and 52 child Portuguese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0181SALA II Spanish from Costa Rica database
    The SALA II Spanish from Costa Rica database comprises 1,165 Costa Rican speakers (574 males, 591 females) recorded over the Costa Rican mobile telephone network.

    ELRA-S0182SALA II Spanish from Argentina database
    The SALA II Spanish from Argentina database comprises 1,076 Argentinian speakers (534 males, 542 females) recorded over the Argentinian mobile telephone network.

    ELRA-S0183OrienTel Morocco MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 772 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0184OrienTel Morocco MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 530 Moroccan speakers recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0185OrienTel French as spoken in Morocco database
    This speech database contains the recordings of 530 Moroccan speakers of French recorded over the Moroccan fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0186OrienTel Tunisia MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 792 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0187OrienTel Tunisia MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 598 Tunisian speakers recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0188OrienTel French as spoken in Tunisia database
    This speech database contains the recordings of 576 Tunisian speakers of French recorded over the Tunisian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0189OrienTel Hebrew database
    This speech database contains the recordings of 1000 Hebrew speakers recorded over the Israeli fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0190OrienTel Arabic as spoken in Israel database
    This speech database contains the recordings of 750 Arabic speakers recorded over the Israeli fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0191ZipTel
    The ZipTel telephone speech database contains 7746 recordings of people applying for a SpeechDat prompt sheet via telephone. The calls were recorded by an automatic telephone server; callers were asked to provide address, ZIP code, city and telephone number.

    ELRA-S0192GlobalPhone Arabic
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0193GlobalPhone Chinese-Mandarin
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0194GlobalPhone Chinese-Shanghai
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0195GlobalPhone Croatian
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0196GlobalPhone Czech
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0197GlobalPhone French
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0198GlobalPhone German
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0199GlobalPhone Japanese
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0200GlobalPhone Korean
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0201GlobalPhone Portuguese (Brazilian)
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0202GlobalPhone Russian
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0203GlobalPhone Spanish (Latin American)
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0204GlobalPhone Swedish
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0205GlobalPhone Tamil
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0206GlobalPhone Turkish
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0207LC-STAR Catalan phonetic lexicon
    The LC-STAR Catalan phonetic lexicon comprises more than 100,000 words, including a set of 53,225 common words, a set of 45,306 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0208LC-STAR Spanish phonetic lexicon
    The LC-STAR Spanish phonetic lexicon comprises more than 100,000 words, including a set of 55,854 common words, a set of 45,403 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0209Oxford English phonetics files
    Derived from a range of Oxford Dictionaries, these files list word forms together with a representation of their IPA pronunciation. It contains 250,000 words. Pronunciation is based on standard British English. Word forms include dictionary lemmas and inflections or other morphological variations, plus a wide range of proper name and encyclopedic material. The data also includes a large number of common multi-word phrases and compound nouns. The files are provided in XML.

    ELRA-S0210Shorter Oxford English Dictionary - Audio Files
    These are recorded headwords for the Shorter Oxford English Dictionary. British English pronunciation. It consists of over 95,000 soundfiles. The files are provided in 11kHz 8-bit WAV.

    ELRA-S0211US Spanish Speecon database
    The Spanish Speecon database comprises the recordings of 550 adult Spanish speakers and 50 child Spanish speakers recorded in the US and who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0212Taiwan Mandarin Speecon database
    The Taiwan Mandarin Speecon database comprises the recordings of 550 adult Taiwanese speakers and 50 child Taiwanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0213Italian Speecon database
    The Italian Speecon database comprises the recordings of 550 adult Italian speakers and 50 child Italian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0214Swedish Speecon database
    The Swedish Speecon database comprises the recordings of 550 adult Swedish speakers and 50 child Swedish speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0215UK English Speecon database
    The UK English Speecon database comprises the recordings of 606 adult UK English speakers and 51 child UK English speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0216German Speecon database
    The German Speecon database comprises the recordings of 562 adult German speakers and 50 child German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0217BITS Logatome Synthesis Corpus – BITS-LG
    This corpus contains 11,036 recordings of logatomes spoken by 4 professional German speakers covering all German diphone combinations as well as the most prominent combination German - French – English. Each logatome was recorded in three channels: close microphone, large membrane microphone and laryngographic signal. All diphones are segmented and labelled into phonemic units.

    ELRA-S0218Speecon manually pitch-marked reference database for Spanish
    This database is intended for the development and the evaluation of noise robust pitch marking (PMA) and/or pitch determination (PDA) algorithms. The recordings of 60 speakers were selected from the Speecon Spanish database (ELRA-S0160). The reference database comprises 60 minutes of pitch-marked speech signal.

    ELRA-S0219NEMLAR Broadcast News Speech Corpus
    The nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc. All files were recorded in linear PCM format, 16 kHz, 16 bit.

    ELRA-S0220NEMLAR Speech Synthesis Corpus
    The nemlar Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.

    ELRA-S0221OrienTel Egypt MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 750 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0222OrienTel Egypt MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 500 Egyptian speakers recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0223OrienTel English as spoken in Egypt database
    This speech database contains the recordings of 500 Egyptian speakers of English recorded over the Egyptian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0224BITS Unit Selection Synthesis Corpus
    BITS-US:  This corpus contains 6,732 recordings spoken by 4 professional German speakers covering all German diphone combinations in different prosodic contexts. Each sentence was recorded in three channels: close microphone, large membrane microphone and laryngographic signal. All recordings are segmented and labelled into phonemic units as well as annotated prosodically.

    ELRA-S0225SALA II Canadian French database
    The SALA II Canadian French database comprises 1000 Canadian speakers (502 males, 498 females) recorded over the Canadian mobile telephone network.

    ELRA-S0226-01IDIOLOGOS 1 “Bootstrap” (NEOLOGOS Project)
    The IDIOLOGOS 1 “Bootstrap” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 1000 adult French speakers (470 males, 530 females) recorded over the French fixed telephone network.

    ELRA-S0226-02IDIOLOGOS 2 “Eingenspeakers” (NEOLOGOS Project)
    The IDIOLOGOS 2 “Eingenspeakers” database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 200 adult French speakers (97 males, 103 females) recorded over the French fixed telephone network.

    ELRA-S0227PAIDIALOGOS (NEOLOGOS Project)
    The PAIDIALOGOS database was produced within the French national project NEOLOGOS, as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It comprises 1010 child French speakers (510 males, 500 females) recorded over the French fixed telephone network.

    ELRA-S0228-01Mandarin Chinese Speech Synthesis Corpus (Basic Corpus)
    This corpus contains the recordings of 1 native Chinese speaker (female).
    The corpus is composed of 20 texts with 109,227 words and has been proofread manually. The corpus contents include: phrases, digit strings, letter strings, uncommon words, neutral tone, final retroflexion, Latin alphabet, interrogative sentences, 282 English words.
    The speaker has been recorded in a professional recording studio over 2 channels: microphone and glottis wave (fundamental frequency) signals for a total of 18.2 hours.
    Speech samples are stored as sequences of 16-bit 44,1 kHz PCM on two channels. The total data size is 5.67 Gb for a total of 12,679 files. The data is encoded in GB-2312 format.
    The transcriptions include labels for four-class pause boundaries.
    This database is aimed to be used within text-to-speech and speech synthesis applications.

    ELRA-S0228-02Mandarin Chinese Speech Synthesis Corpus
    This corpus contains the recordings of 1 native Chinese speaker (female).
    The corpus is complementing the Basic Corpus (ELRA-S0228/01) and aims at covering a variety of speech context data which does not include syllables.
    The corpus is composed of 28 texts with 75,841 words and has been proofread manually. The corpus contents include: text of statements, digit strings, uncommon words, letter strings, measurement units, neutral tone, final retroflexion, latin alphabet, interrogative sentences, English words and room-ordering stimulation.
    The speaker has been recorded in a professional recording studio over 2 channels: microphone and glottis wave (fundamental frequency) signals for a total of 30.2 hours.

    ELRA-S0228-03Mandarin Chinese Speech Synthesis Corpus (Integrated Corpus)
    The Mandarin Chinese Speech Synthesis Integrated Corpus includes both Basic and Accessory Corpora (see ELRA-S0228/01 and ELRA- S0228/02).

    ELRA-S0228-04Mandarin Chinese Telephone Speech Recognition Corpus - Person Name, Place Name (Mobile telephone 265)
    This corpus comprises 6,952 entries uttered by 265 speakers of different dialects, ages and various educational levels (134 males and 131 females), recorded over the mobile telephone network. The database comprises 13,942 Chinese personal names and place names. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 17.6 hours of speech. The total capacity of the data is 964 Mb.
    Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-05Mandarin Chinese Telephone Speech Recognition Corpus -Person Name, Place Name
    This corpus comprises 7,298 entries uttered by 285 speakers of different dialects, ages and various educational levels (144 males and 141 females), recorded over the fixed telephone network. The database comprises 14,492 Chinese personal names and place names. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 17.6 hours of speech. The total capacity of the data is 968 Mb.
    Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-06Mandarin Chinese Telephone Speech Recognition Corpus - Digit String
    This corpus comprises 5,309 entries uttered by 265 speakers of different dialects, ages and various educational levels (134 males and 131 females), recorded over the fixed telephone network. The database comprises 7,606 Chinese digit strings. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 11.8 hours of speech. The total capacity of the data is 648 Mb.
    Each speaker read 25-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-07Mandarin Chinese Telephone Speech Recognition Corpus - Digit String
    This corpus comprises 6,140 entries uttered by 265 speakers of different dialects, ages and various educational levels (144 males and 141 females), recorded over the mobile telephone network. The database comprises 8,109 Chinese digit strings. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 11.8 hours of speech. The total capacity of the data is 669 Mb.
    Each speaker read 25-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-08Mandarin Chinese Telephone Speech Recognition Corpus - Stock
    This corpus comprises 3,085 entries uttered by 265 speakers of different dialects, ages and various educational levels (134 males and 131 females), recorded over the mobile telephone network. The database comprises 6,972 Chinese stocks. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 7 hours of speech. The total capacity of the data is 387 Mb.
    Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-09Mandarin Chinese Telephone Speech Recognition Corpus - Stock
    This corpus comprises 3,077 entries uttered by 285 speakers of different dialects, ages and various educational levels (144 males and 141 females), recorded over the fixed telephone network. The database comprises 7,239 Chinese stocks. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 7 hours of speech. The total capacity of the data is 373 Mb.
    Each speaker read 15-30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-10Mandarin Chinese Telephone Speech Recognition Corpus – SMS (Mobile telephone 64)
    This corpus comprises 1,079 entries uttered by 64 speakers of different dialects, ages and various educational levels (52 males and 12 females), recorded over the mobile telephone network. The database comprises 3,190 Chinese short messages (SMS). Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 3 hours of speech. The total capacity of the data is 161 Mb.
    Each speaker read 50 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-11Mandarin Chinese Telephone Speech Recognition Corpus – SMS (Fixed phone 86)
    This corpus comprises 1,648 entries uttered by 86 speakers of different dialects, ages and various educational levels (64 males and 22 females), recorded over the fixed telephone network. The database comprises 4,282 Chinese short messages (SMS). Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 3.7 hours of speech. The total capacity of the data is 205 Mb.
    Each speaker read 50 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-12Mandarin Chinese Desktop Speech Recognition Corpus - SMS (200 people)
    This corpus comprises 7,276 entries uttered by 200 speakers of different dialects, ages and various educational levels (87 males and 113 females), recorded over 4 channels (Mic1: SHURE SM58; Mic2: ANC-700 Head-mounted; Mic3: TELEX M-60; Mic4: ACOUSTIC MAGIC). The database comprises 23,949 short messages (SMS) per channel. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 35.6 hours of speech per channel. The total capacity of the data is 21.1 Gb.
    Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-13Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (200 people)
    This corpus comprises 1,500 entries uttered by 200 speakers of different dialects, ages and various educational levels (87 males and 113 females), recorded over 4 channels (Mic1: SHURE SM58; Mic2: ANC-700 Head-mounted; Mic3: TELEX M-60; Mic4: ACOUSTIC MAGIC). The database comprises 6,000 digit strings per channel. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 11.5 hours of speech per channel. The total capacity of the data is 6.82 Gb.
    Each speaker read 30 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-14Mandarin Chinese Desktop Speech Recognition Corpus - Person name, Place Name (10 people)
    This corpus comprises 782 entries uttered by 10 speakers of different dialects, ages and various educational levels (3 males and 7 females), recorded over 4 channels (Mic1: SHURE SM58; Mic2: ANC-700 Head-mounted; Mic3: TELEX M-60; Mic4: ACOUSTIC MAGIC). The database comprises 800 Chinese items per channel: 30 stocks, 10 nation names, 10 Chinese city names, 30 person names. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 0.97 hours of speech per channel. The total capacity of the data is 587 Mb.
    Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-15Mandarin Chinese Desktop Speech Recognition Corpus - SMS (120 people)
    This corpus comprises 7,142 entries uttered by 120 speakers of different dialects, ages and various educational levels (59 males and 61 females), recorded through head-mounted noise-canceling microphone. The database comprises 16,499 short messages (SMS). Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 21.7 hours of speech. The total capacity of the data is 3.2 Gb.
    Each speaker read 120-150 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-16Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (120 people)
    This corpus comprises 1,500 entries uttered by 120 speakers of different dialects, ages and various educational levels (59 males and 61 females), recorded through head-mounted noise-canceling microphone. The database comprises 3,600 digit strings. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for a total of 6.2 hours of speech. The total capacity of the data is 945 Mb.
    Each speaker read 120-150 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-17Mandarin Chinese Desktop Speech Recognition Corpus - Person Name, Place Name (70 people)
    This corpus comprises 9,667 entries uttered by 70 speakers of different dialects, ages and various educational levels (38 males and 32 females), recorded through head-mounted noise-canceling microphone. The database comprises 12,596 items. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for a total of 15 hours of speech. The total capacity of the data is 2.17 Gb.
    Each speaker read 60 person names, 20 country names, 10 Chinese city names, 30 street names, 50 company and organization names, 10 geographical names. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-18Mandarin Chinese Desktop Speech Recognition Corpus - Stock (70 people)
    This corpus comprises 1,586 entries uttered by 70 speakers of different dialects, ages and various educational levels (38 males and 32 females), recorded through head-mounted noise-canceling microphone. The database comprises 4,199 items. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for a total of 5.1 hours of speech. The total capacity of the data is 776 Mb.
    Each speaker read 60 stocks. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-19Mandarin Chinese Desktop Speech Recognition Corpus - Spontaneous Speech (50 people)
    This corpus comprises spontaneous speech (elicited) from 50 speakers of different dialects, ages and various educational levels (21 males and 29 females), who uttered 36 different topics in a working environment, recorded through head-mounted noise-cancelling microphone. The database comprises 600 speech files. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for a total of 8 hours of speech. The total capacity of the data is 2.37 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-20Mandarin Chinese Desktop Speech Recognition Corpus - Stock、 Person Name 、Digit String、Simple Chinese sentences、Spontaneous Speech (50 people)
    This corpus comprises 8,206 entries including stocks, person names, digit strings and 8,511 speech files composed of spontaneous speech, uttered by 50 speakers of different dialects, ages and various educational levels (22 males and 28 females), recorded from a stand microphone (SHURE SM58). Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for a total of 24 hours of speech. The total capacity of the data is 7 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-21Mandarin Chinese Desktop Speech Recognition Corpus - Simple Chinese sentences (850 people)
    This corpus comprises 14,011 entries uttered by 850 speakers of different dialects, ages and various educational levels (420 males and 430 females), recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). The database comprises 104,750 sentences per channel. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 150 hours of speech per channel. The total capacity of the data is 88 Gb.
    600 speakers read 120 sentences and 250 speakers read 131 sentences. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-22Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (849 people)
    This corpus comprises 750 entries uttered by 849 speakers of different dialects, ages and various educational levels (420 males and 429 females), recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). The database comprises 12,750 digit strings per channel. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 21 hours of speech per channel. The total capacity of the data is 12.9 Gb.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-23Mandarin Chinese Desktop Speech Recognition Corpus - Person name (849 people)
    This corpus comprises 2,250 entries uttered by 849 speakers of different dialects, ages and various educational levels (420 males and 429 females), recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). The database comprises 12,750 person names per channel. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 18.3 hours of speech per channel. The total capacity of the data is 11 Gb.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-24Mandarin Chinese Desktop Speech Recognition Corpus - Stock (849 people)
    This corpus comprises 1,584 entries uttered by 849 speakers of different dialects, ages and various educational levels (420 males and 429 females), recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). The database comprises 13,600 stocks per channel. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 20 hours of speech per channel. The total capacity of the data is 12 Gb.
    Each speaker read 16 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-25Mandarin Chinese Desktop Speech Recognition Corpus - Spontaneous Speech (849 people)
    This corpus comprises spontaneous speech (elicited) from 849 speakers of different dialects, ages and various educational levels (420 males and 429 females), who uttered 40 different topics, recorded over 2 channels (Mic1: SHURE SM58; Mic2: Labtec Axis-002). Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for a total of 208 hours of speech per channel. The total capacity of the data is 122.8 Gb.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-26Mandarin Chinese Telephone Speech Recognition Corpus - Simple Chinese sentences (650 people)
    This corpus comprises 8,011 entries uttered by 650 speakers of different dialects, ages and various educational levels (340 males and 310 females), recorded over the fixed telephone network. The database comprises 80,750 simple Chinese sentences. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 134 hours of speech.
    400 speakers read 120 items, 250 speakers read 131 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-27Mandarin Chinese Telephone Speech Recognition Corpus - Digit String (649 people)
    This corpus comprises 750 entries uttered by 649 speakers of different dialects, ages and various educational levels (340 males and 309 females), recorded over the fixed telephone network. The database comprises 9,750 digit strings. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 16.28 hours of speech.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-28Mandarin Chinese Telephone Speech Recognition Corpus - Person Name (649 people)
    This corpus comprises 2,250 entries uttered by 649 speakers of different dialects, ages and various educational levels (340 males and 309 females), recorded over the fixed telephone network. The database comprises 9,750 person names. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 13.97 hours of speech.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-29Mandarin Chinese Telephone Speech Recognition Corpus – Stock (649 people)
    This corpus comprises 1,584 entries uttered by 649 speakers of different dialects, ages and various educational levels (340 males and 309 females), recorded over the fixed telephone network. The database comprises 10,400 stocks. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 12.99 hours of speech.
    Each speaker read 16 items. Text files are stored in Unicode format. All data have been proofread manually.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-30Mandarin Chinese Telephone Speech Recognition Corpus - Spontaneous Speech (649 people)
    This corpus comprises spontaneous speech (elicited) from 649 speakers of different dialects, ages and various educational levels (340 males and 309 females), who uttered 40 different topics in a working environment, recorded over the fixed telephone network. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 143.05 hours of speech.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually. The total capacity of the data is 7.67 Gb.
    The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-31Mandarin Chinese Desktop Speech Recognition Corpus - Monosyllabic (98 people)
    This corpus comprises 1,267 entries uttered by 98 speakers of different dialects, ages and various educational levels (46 males and 52 females), recorded over 3 channels (Mic 1: SHURE SM58; Mic 2: Labtec Axis-002; Mic 3: ATR 60C). The database comprises monosyllables. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 88 hours of speech per channel. The total capacity of the data is 78 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-32Mandarin Chinese Desktop Speech Recognition Corpus - Digit String (98 people)
    This corpus comprises 1,500 entries uttered by 98 speakers of different dialects, ages and various educational levels (46 males and 52 females), recorded over 4 channels (Mic 1: SHURE SM58; Mic 2: Labtec Axis-002; Mic 3: KOSS; Mic 4: ATR 60C). The database comprises digit strings. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 8.7 hours of speech per channel. The total capacity of the data is 10 Gb.
    Each speaker read 50 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-33Mandarin Chinese Speech Recognition Corpus (desktop) - place name (120 people)
    This corpus comprises 3,600 speech files uttered by 120 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 4,858 place names. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 6.26 hours of speech per channel. The total capacity of the data is 6.04 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-34Mandarin Chinese Speech Recognition Corpus (desktop) - short message (120 people)
    This corpus comprises 3,600 speech files uttered by 120 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 7,161 Chinese short messages (SMS) in total. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 5.86 hours of speech per channel. The total capacity of the data is 5.65 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-35Mandarin Chinese Speech Recognition Corpus (desktop) - person name (120 people)
    This corpus comprises 3,586 speech files uttered by 120 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 2,250 person names in total. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 6.19 hours of speech per channel. The total capacity of the data is 5.97 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-36Mandarin Chinese Speech Recognition Corpus (desktop) - digit string (119 people)
    This corpus comprises 3,570 speech files uttered by 119 speakers of different dialects, ages and various educational levels, recorded over 3 channels (Mic 1: SHURE Beta53; Mic 2: AKG C4000b; Mic 3: Labtec Axis 002). The database comprises 1,500 digit strings in total. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 7.54 hours of speech per channel. The total capacity of the data is 7.28 Gb.
    Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-37Mandarin Chinese Speech Recognition Corpus (in the car) - person name, place name in Beijing, stocks, digit string (20 people)
    This corpus comprises 9,599 speech files uttered by 20 speakers of different dialects, ages and various educational levels, recorded over 2 channels. The database comprises person names, place names in Beijing, stocks, digit strings. Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 10.45 hours of speech per channel. The total capacity of the data is 3.08 Gb.
    Each speaker read 15 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-38Mandarin Chinese Speech Recognition Corpus (telephone channel) - Chinese single sentence (100 people)
    This corpus comprises sentences uttered by 100 speakers of different dialects, ages and various educational levels. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 7.3 hours of speech. The total capacity of the data is 400 Mb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-39Mandarin Chinese Speech Recognition Corpus (telephone channel) - person name (100 people)
    This corpus comprises person names uttered by 100 speakers of different dialects, ages and various educational levels. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 6 hours of speech. The total capacity of the data is 328 Mb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-40Mandarin Chinese Speech Recognition Corpus (telephone channel) - place name (100 people)
    This corpus comprises place names uttered by 100 speakers of different dialects, ages and various educational levels. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 6.2 hours of speech. The total capacity of the data is 338 Mb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-41Mandarin Chinese Speech Recognition Corpus (telephone channel) - digit string (100 people)
    This corpus comprises digit strings uttered by 100 speakers of different dialects, ages and various educational levels. Speech samples are stored as a sequence of 16-bit 8kHz WAV for a total of 7.5 hours of speech. The total capacity of the data is 410 Mb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-42Mandarin Chinese Speech Recognition Corpus (desktop) - single Chinese sentence (200 people)
    This corpus comprises 8,000 Chinese sentences uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 12.21 hours of speech per channel. The total capacity of the data is 7.2 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-43Mandarin Chinese Speech Recognition Corpus (desktop)- person name (200 people)
    This corpus comprises 8,000 person names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 10 hours of speech per channel. The total capacity of the data is 5.92 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-44Mandarin Chinese Speech Recognition Corpus (desktop) - place name (200 people)
    This corpus comprises 8,000 place names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 10.49 hours of speech per channel. The total capacity of the data is 6.2 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-45Mandarin Chinese Speech Recognition Corpus (desktop) - digit string (200 people)
    This corpus comprises 8,000 digit strings uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 2 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 12.35 hours of speech per channel. The total capacity of the data is 7.3 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-46Mandarin Chinese high clarity Speech Recognition Corpus (in recording studio) - single Chinese sentence (200 people)
    This corpus (in recording studio) comprises 8,000 Chinese sentences uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 12 hours of speech per channel. The total capacity of the data is 14.22 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-47Mandarin Chinese high clarity Speech Recognition Corpus (in recording studio) - (desktop) – person name (200 people)
    This corpus comprises 8,000 Chinese person names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 10 hours of speech per channel. The total capacity of the data is 12 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-48Mandarin Chinese high clarity Speech Recognition Corpus (in recording studio) - (desktop) – place name (200 people)
    This corpus comprises 8,000 Chinese place names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 12.27 hours of speech per channel. The total capacity of the data is 14.45 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-49Mandarin Chinese high clarity Speech Recognition Corpus (in recording studio) - (desktop) – digit string (200 people)
    This corpus comprises 8,000 digit strings uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 44.1kHz WAV for 13.3 hours of speech per channel. The total capacity of the data is 15.7 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-50Korean Mandarin Speech Recognition Corpus (desktop) – person name (150 people)
    This corpus comprises 1,500 Korean Mandarin person names uttered by 150 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 1.56 hours of speech per channel. The total capacity of the data is 2 Gb.
    Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-51Korean Mandarin Speech Recognition Corpus (desktop) – place name (150 people)
    This corpus comprises 1,500 Korean Mandarin place names uttered by 150 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 1.53 hours of speech per channel. The total capacity of the data is 2 Gb.
    Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-52Korean Mandarin Speech Recognition Corpus (desktop) – digit string (110 people)
    This corpus comprises 13,200 Korean Mandarin digit strings uttered by 110 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 18.87 hours of speech per channel. The total capacity of the data is 24.2 Gb.
    Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-53Korean Mandarin Speech Recognition Corpus (desktop) – single Korean sentences (40 people)
    This corpus comprises 4,800 Korean Mandarin sentences uttered by 40 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 7.63 hours of speech per channel. The total capacity of the data is 9.82 Gb.
    Each speaker read 120 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-54Japanese Mandarin Speech Recognition Corpus (desktop) – single Japanese sentence (200 people)
    This corpus comprises 12,000 Japanese Mandarin sentences uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 22.12 hours of speech per channel. The total capacity of the data is 28.45 Gb.
    Each speaker read 60 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-55Japanese Mandarin Speech Recognition Corpus (desktop) – digit string (200 people)
    This corpus comprises 8,000 Japanese Mandarin digit strings uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 16.22 hours of speech per channel. The total capacity of the data is 23.23 Gb.
    Each speaker read 40 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-56Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese person name (200 people)
    This corpus comprises 2,000 Japanese Mandarin person names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 4.41 hours of speech per channel. The total capacity of the data is 5.67 Gb.
    Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0228-57Japanese Mandarin Speech Recognition Corpus (desktop) – Japanese place name (200 people)
    This corpus comprises 2,000 Japanese Mandarin place names uttered by 200 speakers of different dialects, ages and various educational levels, recorded over 4 channels. Speech samples are stored as a sequence of 16-bit 48kHz WAV for 3.09 hours of speech per channel. The total capacity of the data is 3.96 Gb.
    Each speaker read 10 items. Text files are stored in Unicode format. All data have been proofread manually.
    The corpus aims to be applied to the testing and telephone natural speech recognition system.

    ELRA-S0229LC-STAR Turkish phonetic lexicon
    The LC-STAR Turkish lexicon comprises 104,513 words, including a set of 59,213 common words, a set of 45,300 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,498 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0230LC-STAR Russian phonetic lexicon
    The LC-STAR Russian lexicon comprises about 128,000 words, including a set of 77,154 common words, a set of 51,074 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 12,012 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0231LC-STAR English-Russian Bilingual Aligned Phrasal lexicon
    The LC-STAR English-Russian Bilingual Aligned Phrasal lexicon comprises 10,519 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,000 phrase corpus. The lexicon is provided in XML format.

    ELRA-S0232Swiss-German Speecon database
    The Swiss-German Speecon database comprises the recordings of 550 adult Swiss-German speakers and 50 child Swiss-German speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0233US English Speecon database
    The US English Speecon database comprises the recordings of 550 adult US English speakers and 50 child US English speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0234SALA Spanish Chilean Database
    The SALA Spanish Chilean Database comprises 1,024 Chilean speakers (477 males, 547 females) recorded over the Chilean fixed telephone network.

    ELRA-S0235LC-STAR Hebrew (Israel) phonetic lexicon
    The LC-STAR Hebrew (Israel) phonetic lexicon comprises 109,580 words, including a set of 62,431 common words, a set of 47,149 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,677 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0236LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon
    The LC-STAR English-Hebrew (Israel) Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a 10,449 US-English phrase corpus. The lexicon is provided in XML format.

    ELRA-S0237LC-STAR US English phonetic lexicon
    The LC-STAR US English phonetic lexicon comprises 102,310 words, including a set of 51,119 common words, a set of 51,111 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,807 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0238MIST Multi-lingual Interoperability in Speech Technology database
    The MIST Multi-lingual Interoperability in Speech Technology database comprises the recordings of 74 native Dutch speakers (52 males, 22 females) who uttered 10 sentences in Dutch, English, French and German. These sentences comprise 5 sentences per language which are identical for all speakers and 5 sentences per language which are unique for each speaker. Dutch sentences are orthographically annotated.

    ELRA-S0239N4 (NATO Native and Non Native) database
    The (NATO Native and Non Native) database comprises speech data recorded in the naval transmission training centers of four countries (Germany, The Netherlands, United Kingdom, and Canada) during naval communication training sessions in 2000-2002. The material consists of native and non-native speakers using NATO Naval English procedure between ships, and reading from a text, "The North Wind and the Sun," in both English and the speaker's native language. The audio material was recorded on DAT and downsampled to 16kHz-16bit, and all the audio files have been manually transcribed and annotated with speakers identities using the Transcriber tool.

    ELRA-S0240French-Canadian Speecon database
    The French-Canadian Speecon database comprises the recordings of 550 adult French-Canadian speakers and 50 child French-Canadian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0241ESTER Corpus
    The ESTER Corpus is a subset of the ESTER Evaluation Package (catalogue ref. ELRA-E0021), which was produced within the French national project ESTER (Evaluation of Broadcast News enriched transcription systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The ESTER project enabled to carry out a campaign for the evaluation of Broadcast News enriched transcription systems for French.
    This corpus includes the material that was used for the ESTER evaluation campaign, excluding the textual data (available in this catalogue and referenced ELRA-W0015 and ELRA-W0023).

    ELRA-S0242SALA II US English database
    The SALA II US English database comprises 4,090 US English speakers (2,017 males, 2,073 females, including some speakers with Hispanic accents) recorded over the United States mobile telephone network.

    ELRA-S0243SpeechDat Catalan FDB database
    The SpeechDat Catalan FDB database contains the recordings of 1,005 Catalan speakers (474 males, 531 females) recorded over the Spanish fixed telephone network.

    ELRA-S0244Japanese Speecon database
    The Japanese Speecon database comprises the recordings of 556 adult Japanese speakers and 51 child Japanese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    Prices available upon request. Please contact us.

    ELRA-S0245LC-STAR German Phonetic lexicon
    The LC-STAR German Phonetic lexicon comprises 102,169 entries, including a set of 55,507 common words, a set of 46,662 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 6,763 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0246LC-STAR German Phonetic lexicon in the Touristic Domain
    The LC-STAR German Phonetic lexicon in the Touristic Domain comprises 8,782 entries from the following categories: nouns, adjectives and verbs. For each entry the following information is provided: orthographic form, part-of-speech (POS), phonemic transcription. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0247LC-STAR Standard Arabic Phonetic lexicon
    The LC-STAR Standard Arabic Phonetic lexicon comprises 110,271 entries, including a set of 52,981 common words, a set of 50,135 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,155 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0248LC-STAR English-German Bilingual Aligned Phrasal lexicon
    The LC-STAR English-German Bilingual Aligned Phrasal lexicon comprises 10,733 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,518 phrase corpus. The lexicon is provided in XML format.

    ELRA-S0249TC-STAR English Training Corpora for ASR: Transcriptions of EPPS Speech
    This corpus consists of transcriptions from 92 hours of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English (a mixture of native and non-native English). The transcription files are stored in Transcriber XML file format.

    For corresponding recordings, see ELRA-S0251

    ELRA-S0250TC-STAR English-Spanish Training Corpora for Machine Translation: Aligned Final Text Editions of EPPS
    This corpus consists of respectively 34 million (English) and 38 million (Spanish) running words of bilingual sentence segmented and aligned texts in English and Spanish obtained from the Final Text Editions provided by the European Parliament (from April 1996 to Sept. 2004, Dec. 2004 to May 2005, and Dec. 2005 to May 2006. The data is accompanied by tools for further preprocessing.

    ELRA-S0251TC-STAR English Training Corpora for ASR: Recordings of EPPS Speech
    This corpus consists of the recordings of around 290 hours form EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English, 92 hours of which were annotated (transcribed) (the transcriptions are not provided in the present package). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.

    For corresponding transcriptions, see ELRA-S0249.

    ELRA-S0252TC-STAR Spanish Training Corpora for ASR: Recordings of EPPS Speech
    This corpus consists of the recordings of around 283 hours from EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish (a mixture of native and non-native Spanish). Each file contains a single channel with 16-bit resolution at a sample rate of 16kHz.

    ELRA-S0253TC-STAR English Test Corpora for ASR
    This corpus consists of 70 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European English and other European languages. From this corpus, 16 hours of English speeches (native or non native) were annotated (transcribed). Each speech file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.

    ELRA-S0254TC-STAR Spanish Test Corpora for ASR
    This corpus consists of 174 hours of recordings of EPPS (European Parliament Plenary Sessions) speeches held or interpreted in European Spanish and other European languages. From this corpus, 16 hours of Spanish speeches were annotated (transcribed). Each audio file contains a single channel with 16-bit resolution at a sample rate of 16kHz. The transcription files are stored in Transcriber XML file format.

    ELRA-S0255LC-STAR Finnish Phonetic lexicon
    The LC-STAR Finnish Phonetic lexicon comprises 189,409 entries, including a set of 144,233 common words, a set of 45,176 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 13,068 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0256LC-STAR Mandarin Chinese Phonetic lexicon
    The LC-STAR Mandarin Chinese Phonetic lexicon comprises 104,368 entries, including a set of 38,098 common words, a set of 57,528 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 7,522 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0257LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon
    The LC-STAR English-Finnish Bilingual Aligned Phrasal lexicon comprises 10,520 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,518 phrase corpus. The lexicon is provided in XML format.

    ELRA-S0258Orientel United Arab Emirates MCA (Modern Colloquial Arabic)
    This speech database contains the recordings of 750 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 48 read and spontaneous items.

    ELRA-S0259Orientel United Arab Emirates MSA (Modern Standard Arabic)
    This speech database contains the recordings of 500 Arabic speakers recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0260Orientel English as spoken in the United Arab Emirates
    This speech database contains the recordings of 535 speakers of English recorded over the United Arab Emirates' fixed and mobile telephone network. Each speaker uttered around 51 read and spontaneous items.

    ELRA-S0261Hungarian SpeechDat(E) Database
    This speech database contains the recordings of 1,000 Hungarian speakers recorded over the Hungarian fixed telephone network. Each speaker uttered around 50 read and spontaneous items.

    ELRA-S0262SALA II Portuguese from Brazil database
    The SALA II Portuguese from Brazil database comprises 1,000 Brazilian speakers recorded over the Brazilian mobile telephone network.

    ELRA-S0263SALA II Spanish from Colombia Database
    The SALA II Spanish from Colombia database comprises 1000 Colombian speakers recorded over the Colombian mobile telephone network.

    ELRA-S0264SALA II US Spanish West
    The SALA II US Spanish West database comprises 1000 Spanish speakers recorded over the American mobile telephone network.

    ELRA-S0265Dutch from Belgium Speecon Database
    The Dutch from Belgium Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    Prices available upon request. Please contact us.

    ELRA-S0266Dutch from the Netherlands Speecon Database
    The Dutch from the Netherlands Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    Prices available upon request. Please contact us.

    ELRA-S0267Danish Speecon Database
    The Danish Speecon database comprises the recordings of 550 adult speakers and 50 child speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    Prices available upon request. Please contact us.

    ELRA-S0268UPC-TALP database of isolated meeting-room acoustic events
    This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission's Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as training material for AED technologies as well as for testing AED algorithms in quiet environments without temporal sound overlapping. Approximately 60 sounds per sound class were recorded. Ten people (5 men and 5 women) participated in three sessions. During each session a person had to produce a complete set of sounds twice.

    ELRA-S0269LC-STAR Greek Phonetic lexicon
    The LC-STAR Greek Phonetic lexicon comprises 110,708 entries, including a set of 57,519 common words, a set of 45,162 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,027 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0270LC-STAR Italian Phonetic lexicon
    The LC-STAR Italian Phonetic lexicon comprises 109,712 entries, including a set of 56,420 common words, a set of 45,253 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 8,039 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0271LC-STAR English-Italian Bilingual Aligned Phrasal lexicon
    The LC-STAR English- Italian Bilingual Aligned Phrasal lexicon comprises 10,466 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,524 phrase corpus. The lexicon is provided in XML format.

    ELRA-S0272MEDIA speech database for French
    The MEDIA speech database for French was produced by ELDA within the French national project MEDIA (Automatic evaluation of man-machine dialogue systems), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). It contains 1,258 transcribed dialogues from 250 adult speakers. The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of tourism and hotel reservation.
    The semantic annotation of the corpus is available in this catalogue and referenced ELRA-E0024 (MEDIA Evaluation Package).

    ELRA-S0273LC-STAR Slovenian Phonetic lexicon
    The LC-STAR Slovenian Phonetic lexicon comprises 110,900 entries, including a set of 64,521 common words, a set of 45,012 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 5,491 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.

    ELRA-S0274LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon
    The LC-STAR English-Slovenian Bilingual Aligned Phrasal lexicon comprises 12,722 phrases from the tourist domain. It is based on a list of short sentences obtained by translation from a US-English 10,522 phrase corpus. The lexicon is provided in XML format.

    ELRA-S0275Slovenian BNSI Broadcast News Speech Corpus
    This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003. The database comprises a total of 36 hours of recordings, transcribed and manually checked using the Transcriber tool. 1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified).

    ELRA-S0276Swedish EUROM1
    EUROM1_S:  EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0277SpeechDat Galician Database for the Fixed Telephone Network
    The SpeechDat Galician Database for the Fixed Telephone Network contains the recordings of 653 speakers of Galician recorded over the fixed telephone network. Each speaker uttered around 44 read and spontaneous items.

    ELRA-S0278SmartWeb Handheld Corpus (SHC)
    This corpus contains recordings spoken by 156 speakers in a human-machine query situation. Users were asked to solve several tasks with a spoken query system to the WWW using a smart phone as portable device in natural environments (office, hall, restaurant, street). Recorded channels are the Bluetooth headset over UMTS (telephone quality), the Bluetooth headset and an additional collar microphone in high quality.
    See also ELRA-S0279 and ELRA-S0280.

    ELRA-S0279SmartWeb Motorbike Corpus (SMC)
    This corpus contains recordings spoken by 36 speakers in a human-machine query situation on a running motor cycle (BMW). Bikers were asked to solve several tasks with a spoken query system to the WWW using an integrated system connected to a speech server via an UMTS connection. Recorded channels are the Bluetooth helmet microphone over UMTS (telephone quality), and - partly - the Bluetooth helmet microphone and an additional neck microphone in high quality.
    See also ELRA-S0278 and ELRA-S0280.

    ELRA-S0280SmartWeb Video Corpus (SVC)
    This multimodal corpus contains 99 recordings each containing a human-human-machine dialogue: one speaker (which is being recorded) interacts with a human partner as well with a dialogue system via a smart phone (SmartWeb system).
    See also ELRA-S0278 and ELRA-S0279.

    ELRA-S0281LILA Hindi-L1 database
    The LILA Hindi-L1 database comprises 2,030 Hindi speakers (1,012 males and 1,018 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-S0282-01BAS PHATT 1.0.X (sub-set)
    The Ph@ttSessionz speech database contains recordings of 864 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 41 locations in Germany. Recordings were done with SpeechRecorder in selected schools in the years 2005-2007. Both channels, the headset and the desktop microphone, were recorded in high quality.

    The BAS PHATT corpus is available in two versions: BAS PHATT 1.0.X (sub-set, ELRA-S0282-01) and BAS PHATT 1.1.X (complete corpus, ELRA-S0282-02).

    BAS PHATT 1.0.X contains 41 items.

    See also ELRA-S0082-02.

    ELRA-S0282-02BAS PHATT 1.1.X (complete corpus)
    The Ph@ttSessionz speech database contains recordings of 864 adolescent speakers of German (age range 12-20). The recordings were performed via the WWW in public schools (Gymnasium) in 41 locations in Germany. Recordings were done with SpeechRecorder in selected schools in the years 2005-2007. Both channels, the headset and the desktop microphone, were recorded in high quality.

    The BAS PHATT corpus is available in two versions: BAS PHATT 1.0.X (sub-set, ELRA-S0282-01) and BAS PHATT 1.1.X (complete corpus, ELRA-S0282-02).

    BAS PHATT 1.1.X contains 138 items.

    See also ELRA-S0082-01.

    ELRA-S0283Laboratory Conditions Czech Audio-Visual Speech Corpus
    UWB-05-LCAVC:  This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems. The corpus consists of about 25 hours of audio-visual records of 65 speakers in laboratory conditions. Data collection was done with static illumination, and recorded subjects were instructed to remain static. The average speaker age was 22 years old. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker).

    ELRA-S0284Czech Audio-Visual Speech Corpus for Recognition with Impaired Conditions
    UWB-07-ICAVR I:  This is an audio-visual speech database for training and testing of Czech audio-visual continuous speech recognition systems collected with impaired illumination conditions. The corpus consists of about 20 hours of audio-visual records of 50 speakers in laboratory conditions. Recorded subjects were instructed to remain static. The illumination varied and chunks of each speaker were recorded with several different conditions, such as full illumination, or illumination from one side (left or right) only. These conditions make the database usable for training lip-/head-tracking systems under various illumination conditions independently of the language. Speakers were asked to read 200 sentences each (50 common for all speakers and 150 specific to each speaker).

    ELRA-S0285Czech Sign Language Corpus for Recognition – Amateur Signer
    UWB-06-SLR-A:  This is an amateur sign-language database comprising 25 signs from Czech sign language. 15 signers (4 women and 11 men) carried out 5 repetitions of each sign and were recorded from 3 different views. The first is a frontal view of the upper part of the body. The data contain 5685 avi files (one per sign performance), using up 7 GB of disk space, and are stored on DVDs.

    ELRA-S0286Czech Sign Language Corpus for Recognition – Professional Signer
    UWB-07-SLR-P:  This database comprises 378 signs from Czech sign language as performed by 4 everyday sign-language users (4 women, 2 of them deaf). 5 repetitions of each sign were recorded from 3 different views. The data contain 21000 avi files (one per sign performance), using up 20 GB of disk space, and are stored on DVDs.

    ELRA-S0287Cantonese Speecon database
    The Cantonese Speecon database comprises the recordings of 550 adult Cantonese speakers and 50 child Cantonese speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0288Thai Speecon database
    The Thai Speecon database comprises the recordings of 552 adult Thai speakers and 50 child Thai speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0289OrienTel Jordan MCA (Modern Colloquial Arabic) database
    This speech database contains the recordings of 757 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0290OrienTel Jordan MSA (Modern Standard Arabic) database
    This speech database contains the recordings of 556 Jordanian speakers recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 49 read and spontaneous items.

    ELRA-S0291OrienTel English as spoken in Jordan database
    This speech database contains the recordings of 578 Jordanian speakers of English recorded over the Jordanian fixed and mobile telephone network. Each speaker uttered around 47 read and spontaneous items.

    ELRA-S0292Danish EUROM1
    EUROM1_D:  EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0293The HIWIRE database, a noisy and non-native English speech corpus for cockpit communication
    The database contains 8,099 English utterances pronounced by non-native speakers (31 French, 20 Greek, 20 Italian, and 10 Spanish speakers). The collected utterances correspond to human input in a command and control aeronautics application. The data was recorded in studio with a close-talking microphone and real noise recorded in an airplane cockpit was artificially added to the data. The signals are provided in clean (studio recordings with close talking microphone), low, mid and high noise conditions. The three noise levels correspond approximately to signal-to-noise ratios of 10dB, 5dB and -5 dB respectively.

    ELRA-S0294CHIEDE Corpus: a spontaneous child language corpus of Spanish
    The spontaneous child language corpus, CHIEDE, consists of 58,163 words, in 30 texts, with 7 hours and 53 minutes of recordings and 59 child participants. About a third of the whole corpus is formed by child language and the remaining two thirds by adult speech. The main feature of CHIEDE is the interactions spontaneity: texts are recordings of communicative situations in their natural context.

    ELRA-S0295LILA Korean database
    The LILA Korean database comprises 1,000 Korean speakers (500 males and 500 females) recorded over the Korean mobile telephone network. Each speaker uttered around 60 read and spontaneous items.

    ELRA-S0296FBK-Irst database of isolated meeting-room acoustic events
    This database has been produced within the CHIL Project (Computers in the Human Interaction Loop), in the framework of an Integrated Project (IP 506909) under the European Commission's Sixth Framework Programme. It contains a set of isolated acoustic events that occur in a meeting room environment and that were recorded for the CHIL Acoustic Event Detection (AED) task. The database can be used as training material for AED algorithms in quiet environments without temporal sound overlapping. The database contains 16 semantic classes of acoustic events. 9 people participated at the recordings. 3 experiments were recorded in different days, each one composed by 4 sessions and executed by 4 persons.

    The database is made available freely via FTP only.

    ELRA-S0297Hungarian Speecon database
    The Hungarian Speecon database comprises the recordings of 555 adult Hungarian speakers and 50 child Hungarian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0298Czech Speecon database
    The Czech Speecon database comprises the recordings of 550 adult Czech speakers and 50 child Czech speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0299Alcohol Language Corpus (BAS ALC)
    ALC contains recordings of 88 German speakers that are either intoxicated or sober. The type of speech ranges from read single digits to full conversation style. Recordings were done during drinking test where speakers drank beer or wine to reach a self-chosen level of alcoholic intoxication. Recordings were performed in two standing automobiles. In the intoxicated state 30 items were sampled from each speaker, while in the sober state 60 items were recorded.

    ELRA-S0300SIGNUM Database
    The SIGNUM Database contains both isolated and continuous utterances of various signers. The corpus was recorded on video. For quick random access to individual frames, each video clip is stored as a sequence of images. The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. Based on this vocabulary, overall 780 sentences were constructed. Each sentence ranges from two to eleven signs in length. The entire corpus was performed once by 25 native signers of different sexes and ages. One of them was chosen to be the so-called reference signer. His performances were recorded three times.

    ELRA-S0301Norwegian EUROM1
    EUROM1_N:  EUROM1 is the first really multilingual speech database produced in Europe. Over 60 speakers per language pronounced numbers, sentences, isolated words using close talking microphone.

    ELRA-S0302TC-STAR female baseline voice: Laura
    Laura contains the recordings of one female English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR male baseline voice: Ian is also available via ELRA under reference ELRA-S0303.

    ELRA-S0303TC-STAR male baseline voice: Ian
    Ian contains the recordings of one male English (British) speaker recorded in a noise-reduced room through a headset microphone. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR female baseline voice: Laura is also available via ELRA under reference ELRA-S0302.

    ELRA-S0304SpeechDat(M) Italian Mobile Network Speech Database
    This speech database contains the recordings of 342 Italian speakers recorded over the Italian mobile telephone network. Each speaker uttered around 40 read and spontaneous items.

    ELRA-S0305EPAC Corpus: orthographic transcriptions
    This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme). This corpus also consists of automatic transcriptions of the full 1,677 hours.

    ELRA-S0306TC-STAR Transcriptions of Spanish Parliamentary Speech
    This corpus consists of the transcriptions of 100 hours of Spanish Parliamentary speech. These comprise 38:24 hours of speech recorded during plenary sessions and commissions between September 2004 and December 2004, and 61:53 hours of speech recorded in the parliamentary plenary sessions as well as recordings of interpreters between May 2004 and January 2005.

    ELRA-S0307BABEL Polish database
    The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). It consists of the basic "common" set which contains the Many Talker Set (30 males, 30 females), the Few Talker Set (5 males, 5 females), the Very Few Talker Set (1 male, 1 female).

    ELRA-S0308Egyptian Arabic Speecon database
    The Egyptian Arabic Speecon database comprises the recordings of 550 adult Egyptian speakers and 50 child Egyptian speakers who uttered respectively over 290 items and 210 items (read and spontaneous).

    ELRA-S0309TC-STAR Spanish Baseline Female Speech Database
    This database contains the recordings of one female Spanish speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Male Speech Database is also available via ELRA under reference ELRA-S0310.

    ELRA-S0310TC-STAR Spanish Baseline Male Speech Database
    This database contains the recordings of one male Spanish speaker recorded simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal in a noise-reduced room. It consists of the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems).

    The TC-STAR Spanish Baseline Female Speech Database is also available via ELRA under reference ELRA-S0309.

    ELRA-S0311TC-STAR Bilingual Voice-Conversion Spanish Speech Database
    4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.

    ELRA-S0312TC-STAR Bilingual Voice-Conversion English Speech Database
    4 hours and 80 minutes of speech as spoken by 2 female speakers and 2 male speakers, covering both mimics and parallel voice conversion data.

    ELRA-S0313TC-STAR Bilingual Expressive Speech Database
    8 hours of speech as spoken by 2 female speakers and 2 male speakers for each language (English and Spanish).

    ELRA-S0314LILA Marathi database
    The LILA Marathi database comprises 2,002 Marathi speakers (992 males and 1010 females) recorded over the Indian mobile telephone network. Each speaker uttered around 46 read and spontaneous items.

    ELRA-S0315A-SpeechDB
    A-SpeechDB© is an Arabic speech database which contains about 20 hours of continuous speech recorded through one desktop omni microphone by 205 native speakers from Egypt (about 30% of females and 70% of males), aged between 20 and 45. Automatically generated transcriptions are provided with a manually revised version for each sentence.

    ELRA-S0316SmartKom Home
    SKH:  Release SKH 1.0 contains 130 recordings in the technical setup (“scenario”) SmartKom Home which should be an intelligent communication assistant for the private environment. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4.5 minutes while they were left alone with the system.

    ELRA-S0317SmartKom Mobil
    SKM:  Release SKM 1.0 contains 146 recordings in the technical setup (“scenario”) SmartKom Mobil which is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a “prototype” for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system.

    ELRA-S0318SmartKom Audio
    SKAUDIO:  Release SKAUDIO 1.0 contains all audio channel recordings of the SmartKom corpora SmartKom Public (cf. ELRA-S0136), SmartKom Home (cf. ELRA-S0316) and SmartKom Mobil (cf. ELRA-S0317).

    ELRA-S0319GlobalPhone Bulgarian
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0320GlobalPhone Polish
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0321GlobalPhone Thai
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0322GlobalPhone Vietnamese
    The GlobalPhone corpus was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks. The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages Arabic, Bulgarian, Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Hausa, Japanese, Korean, Polish, Portuguese (Brazilian), Russian, Spanish (Latin America), Swedish, Tamil, Thai, Turkish, Vietnamese. In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news.

    ELRA-S0323European Parliament Interpretation Corpus (EPIC)
    The EPIC corpus is a parallel corpus of European Parliament speeches and their corresponding simultaneous interpretations. This corpus includes source speeches in Italian, English and Spanish and interpreted speeches in all possible combinations and directions. It contains a total of 357 speeches (177,295 words). The corpus has been orthographically transcribed. Non-tagged transcripts in text format are also available.

    ELRA-S0324Catalan-SpeechDat For the Fixed Telephone Network Database
    This speech database contains the recordings of 2000 Catalan speakers who called from Fixed telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Fixed Telephone Network Database was funded by the Catalan Government.

    ELRA-S0325Catalan-SpeechDat for the Mobile Telephone Network Database
    This speech database contains the recordings of 2000 Catalan speakers who called from GSM telephones and who are recorded over the fixed PSTN using and ISDN-BRI interface. Each speaker uttered around 50 read and spontaneous items. The speech database follows the specifications made within the SpeechDat (II) project. The database was validated by UVIGO. The Catalan-SpeechDat for the Mobile Telephone Network Database was funded by the Catalan Government.

    ELRA-S0326Catalan SpeechDat-Car database
    The Catalan SpeechDat-Car database contains the in-car recordings of 300 speakers who uttered from around 120 read and spontaneous items. Each speaker recorded two sessions. Recordings have been made through 4 different channels, via in-car microphones (1 close-talk microphone, 3 far-talk microphones). The 300 Catalan speakers were selected from 5 different dialectal regions and are balanced in gender and age groups. The database was validated by UVIGO. The Catalan-SpeechDat-Car Database was funded by the Catalan Government.

    ELRA-S0327Catalan Speecon database
    The Catalan Speecon database comprises the recordings of 550 adult Catalan speakers who uttered over 290 items (read and spontaneous). The data were recorded over 4 microphone channels in 4 recording environments (office, entertainment, car, public place). The speech database follows the specifications made within the UE funded Speecon project. The database was validated by UVIGO. The Catalan-Speecon Database was funded by the Catalan Government.

    ELRA-S0328Spanish EUROM.1
    EUROM1 is a multilingual European speech database. It contains over 60 speakers per language who pronounced numbers, sentences, isolated words ... using close talking microphone in an anecoic room. Equivalent corpora for each of the European languages exist already, with the same number of speakers selected in the same way, and recorded in the same conditions with common file formats.

    ELRA-S0329Emotional speech synthesis database
    This database contains the recordings of one male and one female Spanish professional speakers recorded in a noise-reduced room. It consists in recordings and annotations of read text material in neutral style plus six MPEG expressions, all in fast, slow, soft and loud speech styles. The text material is composed of 184 items including phonetically balanced sentences, digits and isolated words. The text material was the same for all the modes and styles, giving a total of 3h 59min recorded speech for the male speaker and 3h 53min for the female speaker. The Emotional speech synthesis database was created within the scope of the Interface EU funded project.

    ELRA-S0330FESTCAT Catalan TTS baseline male speech database
    This database contains the recordings of one male Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. This database consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Male Speech Database was created within the scope of the FESTCAT project, funded by the Catalan Government.

    ELRA-S0331FESTCAT Catalan TTS baseline female speech database
    This database contains the recordings of one female Catalan professional speaker recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists in the recordings and annotations of read text material of approximately 10 hours of speech for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS Baseline Female Speech Database was created within the scope of the FESTCAT project funded by the Catalan Government.

    ELRA-S0332FESTCAT Catalan TTS baseline speech database - 8 speakers
    This database contains the recordings of four female and four male Catalan professional speakers recorded in a noise-reduced room simultaneously through a close talk microphone, a mid distance microphone and a laryngograph signal. It consists of the recordings and annotations of read text material of approximately 1 hour of speech per speaker for baseline applications (Text-to-Speech systems). The FESTCAT Catalan TTS baseline speech database - 8 speakers was created within the scope of the FESTCAT project funded by the Catalan Government.

    ELRA-S0333Spanish Festival HTS models - male speech
    This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Male Speech Database (ELRA-S0310).

    ELRA-S0334Spanish Festival HTS models - female speech
    This database contains the Festival HTS models trained with 10h of speech from the TC-STAR Spanish Baseline Female Speech Database (ELRA-S0309).

    ELRA-S0335Bilingual (Spanish-English) Speech synthesis HTS models
    This database contains Bilingual (English and Spanish) Festival HTS models. Models were trained with 9h of speech from 2 female bilingual speakers and 2 male bilingual speakers. Each speaker recorded 2h 15 min per language. The speech data can be found in the TC-STAR Bilingual Voice-Conversion Spanish Speech Database (ELRA-S0311) and in the TC-STAR Bilingual Expressive Spanish Speech Database (ELRA-S0313).

    ELRA-S0336Spanish Festival voice male
    This database contains a unit-selection voice (clunits technology) for their use in Festival Synthesis System (tested on version 2.0.95:beta April 2010). The voice was built using a subset of speech derived from the TC-STAR Spanish Baseline Male Speech Database: mid distance microphone, 2h26m, 16kHz, 16bits. The database was created within the scope of the METANET4U project funded by the European Commission.

    ELRA-S0337Spanish Festival voice female
    This database contains a unit-selection voice (clunits technology) for their use in Festival Synthesis System (tested on version 2.0.95:beta April 2010). The voice was built using a subset of speech derived from the TC-STAR Spanish Baseline Female Speech Database: mid distance microphone, 4h25m, 16kHz, 16bits. The database was created within the scope of the METANET4U project funded by the European Commission.

    ELRA-S0338ESTER 2 Corpus
    ESTER 2 Corpus, produced within the ESTER 2 evaluation campaign, consists of a manually transcribed radio broadcast news corpus amounting about 100 hours and quick transcriptions of African radios amounting about 6 hours. An annotation of named entities is provided within the development data (about 6 hours).

    ELRA-S0339Acoustic database for Polish unit selection speech synthesis
    This database contains parliamentary statements and newspaper reviews read by a semi-professional male speaker. It consists of a selection of 2150 sentences annotated and manually verified, including 100 rare phonemes in words. The total duration of the recordings is 3.45 hours. The database is phonetically annotated and manually corrected, which represents a lexicon of 11761 words with phonetic transcription.

    ELRA-S0340GlobalPhone French Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The French dictionary contains 36837 entries (20710 words).

    ELRA-S0341GlobalPhone German Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The German dictionary contains 48979 entries (46035 words).

    ELRA-S0342Acoustic database for Polish concatenative speech synthesis
    This database consists of 1443 nonsense words including all the diphones for the Polish language. The database includes information such as: the name of the diphone, context of the diphone, phonetic transcription in SAMPA, identifier of the wave file where it is placed, and three numbers: the beginning, the middle and the end of the diphone.

    ELRA-S0343VERIF1DE
    The speech corpus VERIF1DE contains 20 recordings (sessions) of 150 German speakers each over the telephone network (10 sessions over fixed network and 10 sessions over GSM). Each session contains 40 single recordings, mainly speech read from a prompt sheet.

    ELRA-S0344LILA Hindi Belt database
    The LILA Hindi Belt database comprises 2,023 Hindi speakers (1,011 males and 1,012 females, all speakers with Hindi as first language) recorded over the Indian mobile telephone network. Each speaker uttered 83 read and spontaneous items.

    ELRA-S0345Spoken Portuguese Corpus
    The Spoken Portuguese corpus consists of a total of 86 recordings (8h44m), collected among sociolinguistically diverse speakers having Portuguese as mother tongue or as second language. The corpus was recorded in a situation of spontaneous oral communication, on different themes of everyday life, with speakers of different ages and social and professional backgrounds. The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text.

    ELRA-S0346Fundamental Portuguese Corpus
    The Fundamental Portuguese Corpus is a corpus of spoken language, collected between 1970 and 1974, composed of 1800 recordings (500 hours) made in Continental Portugal and the Islands. Of these 1800 conversations, a sample was selected and transcribed. The corpus consists of audio files in .wav format, aligned transcriptions in XML Exmaralda format and transcriptions in plain text.

    ELRA-S0347GlobalPhone Hausa
    The GlobalPhone Hausa corpus contains 7,895 utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60 years. Native speakers of Hausa were asked to read prompted sentences of newspaper articles. The entire collection took place in 5 different locations in Cameroon. The speech data contains a variety of accents: Maroua, Douala, Yaounde, Bafoussam, Ngaoundere, and Nigeria.

    ELRA-S0348GlobalPhone Japanese Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Japanese dictionary contains 18094 entries.

    ELRA-S0349Quaero Broadcast News Extended Named Entity corpus
    This corpus consists of the manual annotation of (i) the ESTER 2 (see also ELRA-S0338) manual transcription corpus and (ii) the Quaero Speech Recognition Evaluation corpus (manual and automatic transcriptions coming from 3 different ASR systems). The corpus is fully manually annotated according to the Quaero extended and structured named entity definition.

    ELRA-S0350GlobalPhone Arabic Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Arabic dictionary contains 29230 entries (27059 words).

    ELRA-S0351GlobalPhone Bulgarian Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Bulgarian dictionary contains 20193 entries.

    ELRA-S0352GlobalPhone Czech Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Czech dictionary contains 33049 entries (32942 words).

    ELRA-S0353GlobalPhone Hausa Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Hausa dictionary contains 42662 entries (42079 words).

    ELRA-S0354GlobalPhone Polish Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Polish dictionary contains 36484 entries.

    ELRA-S0355GlobalPhone Portuguese (Brazilian) Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Portuguese (Brazilian) dictionary contains 54146 entries (54130 words).

    ELRA-S0356GlobalPhone Swedish Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swedish dictionary contains about 25000 entries.

    ELRA-S0358GlobalPhone Croatian Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Croatian dictionary contains 23497 entries (20628 words).

    ELRA-S0359GlobalPhone Russian Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Russian dictionary contains 23497 entries (20628 words).

    ELRA-S0360GlobalPhone Spanish (Latin American) Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Spanish (Latin American) dictionary contains 43264 entries (33960 words).

    ELRA-S0361GlobalPhone Turkish Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Turkish dictionary contains 31330 entries (31087 words).

    ELRA-S0362GlobalPhone Vietnamese Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Vietnamese dictionary contains 38504 entries (29974 words).

    ELRA-S0363GlobalPhone Chinese-Mandarin Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Chinese-Mandarin dictionary contains 73388 pronunciations.

    ELRA-S0364GlobalPhone Korean Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Korean dictionary contains 3500 syllables.

    ELRA-S0365aGender
    aGender contains speech sample recordings over public telephone lines with read and (semi-)spontaneous speech. Native German speakers called a voice portal from their private phone, and read text + answered some open questions. The corpus contains the voices of 945 German speakers (approx. minimum of 100 speakers per class), each delivering 18 speech items in up to six different sessions.

    ELRA-S0366LECTRA (LECture TRAnscriptions in European Portuguese)
    This corpus is composed of the audio and the manual transcriptions from seven 1-semester University courses in Portuguese. The corpus contains a total of 28 hours of audio speech that were manually transcribed by several trained annotators. The corpus is comprised of technical University lectures.

    ELRA-S0367CORAL Corpus
    The CORAL Corpus is a collection of spoken dialogues in European Portuguese. It consists of 56 dialogues about a predetermined subject: maps. One of the participants (giver) has a map with some landmarks and a route drawn between them; the other (follower) has also landmarks, but no route and consequently must reconstruct it. Only orthographic transcription was done for the whole corpus. A pilot recording was annotated in several levels.

    ELRA-S0368Nepali Spoken Corpus
    The Nepali Spoken Corpus contains audio recordings from different social activities within their natural settings as much as possible, with phonologically transcribed and annotated texts, and information about the participants. A total of 17 types of activity were recorded. The total temporal duration of the recorded material is 31 hours and 26 minutes.

    ELRA-S0369CLIPS_MT_MANUAL
    CLIPS_MT_MANUAL is a sub-corpus of the original Italian CLIPS corpus (Corpora e Lessici dell'Italiano Parlato e Scritto). This corpus contains 3228 inspected and partially repaired WAV signal files, each containing one dialogue turn (*.wav), 3228 corrected original CLIPS annotation files (*.acs, *.phn, *.std, *.wrd), 3228 BAS Partitur files containing the annotation tiers ORT, KAN and SAP (*.par), 3228 EMU database annotation files (*.vot, *.hlb) covering 30 maptask dialogues performed by 30 speakers (each speaker pair performing two different map tasks) recorded in 15 different locations in Italy in 2000-2004.

    ELRA-S0370MoveOn Speech and Noise Corpus
    The MoveOn Speech and Noise Corpus is a corpus recorded under the extreme conditions of the motorcycle environment within the MoveOn project. The speech utterances are in British English approaching the issue of command and control and template driven dialog systems with a focus on – but not limited to - the police domain. The major part of the corpus comprises noisy speech and environmental noise recorded on a motorcycle. Several clean speech recording sessions with the same recording setup (including the motorcycle helmet) in an office environment complete the corpus.

    ELRA-S0371PortMedia French and Italian corpus
    This corpus contains 700 transcribed dialogues from about 140 French speakers and 604 transcribed dialogues from about 150 Italian speakers (several dialogues per speaker). The method chosen for the corpus construction process is that of a ‘Wizard of Oz’ (WoZ) system. This consists of simulating a natural language man-machine dialogue. The scenario was built in the domain of touristic information and reservation. A manual transcription and semantic annotation of the corpus are provided with corresponding wave files.

    ELRA-S0372GlobalPhone Thai Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Thai dictionary is divided in 2 sets: a small set with 12,420 pronunciation entries of 12,420 different words, and does not include pronunciation variants, and a larger set which contains 25,570 pronunciation entries of 22,462 different words units, and includes 3,108 entries of up to four pronunciation variants.

    ELRA-S0373GVLEX tales corpus
    GVLEX tales corpus consists of 89 written tales, manually annotated in structures, speech turns, speakers, phrases, 7 of which were annotated by 2 human annotators (96 annotated texts in total); 12 tales read by a professional, transcribed and manually annotated, including audio files; and annotation and viewing software developed within the GV-LEX project

    ELRA-S0374FoxPersonTracks: a Benchmark for Person Re-Identification from TV Broadcast Shows
    FoxPersonTracks is a person track dataset dedicated to person re-identification. The dataset is built from a set of real life TV shows broadcasted from BFMTV and LCP TV french channels, provided during REPERE challenge. It contains a total 4,604 persontracks (short video sequences featuring an individual with no background) from 266 persons. The dataset also provides re-identification results using space-time histograms as a baseline, together with an evaluation tool in order to ease the comparison to other re- identification methods.

    ELRA-S0375GlobalPhone Swahili
    The GlobalPhone Swahili corpus contains 7,728 utterances spoken by 70 speakers. Native speakers of Swahili were asked to read prompted sentences of newspaper articles. The entire collection took place in Nairobi, Kenya.

    ELRA-S0376GlobalPhone Swahili Pronunciation Dictionary
    The GlobalPhone pronunciation dictionaries contain the pronunciations of all word forms found in the transcription data of the GlobalPhone speech & text database. The Swahili dictionary contains 10664 entries.