Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • R&D Catalogue of Language Resources R&D Catalogue of Language Resources

    Considering the needs expressed by several academic institutions of the Human Language Technology field, ELDA is pleased to offer access to a version of its Catalogue of Language Resources dedicated to academic research. Indeed, at various occasions, while discussing with the players of the R&D academic community, we concluded to the importance to allow an easy and fast access to a list of resources more specifically produced for R&D purposes in Human Language Technology.

    Thus, we now provide a list of Language Resources, available at very affordable prices, and dedicated to a research use. So as to facilitate the access to this list, we preserved the interface and browsing tools of the ELDA catalogue. Of course, at any time, you may choose to return to the full version of the catalogue. Very soon, we will also implement an advanced search which will allow you to browse through our catalogue thanks to pre-defined selection criteria, such as the type of resources or the prices available (and many more criteria).

    Like the full version of the catalogue, the language resources available here are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".

    1/ Spoken LRs

    a - Telephone recordings
    The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.

    b - Desktop/Microphone recordings
    The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.

    c - Broadcast Resources
    The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.

    d - Speech Related Resources
    You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.

    2/ Written LRs

    a - Corpora
    This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.

    b - Monolingual lexicons
    The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.

    c - Multilingual lexicons
    Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.

    3/ Terminological LRs

    Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.

    4/ Multimodal/Multimedia LRs

    The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.


    New Resources
  • ELRA-S0393 : Persian Speech Corpus
    This speech corpus was recorded through
    a "Blubbery" model microphone by one
    male speaker in Persian (Tehrani accent)
    in a professional studio. Synthesized
    speech as an output using this corpus
    has produced a high quality, natural
    voice. It consists of 399 utterances for
    a total of about 2.5 hours, with
    orthographic and phonetic

  • ELRA-S0391 : The FAME! Speech Corpus
    This Frisian corpus consists of 203
    audio segments of approximately 5
    minutes long extracted from various
    radio programs covering a time span of
    almost 50 years (1966-2015), adding a
    longitudinal dimension to the database.
    The content of the recordings are very
    diverse including radio programs about
    culture, history, literature, sports,
    nature, agriculture, politics, society
    and languages. There are 309 identified
    speakers in the FAME! Speech Corpus, 21
    of whom appear at least 3 times in the
    database. The total duration of the
    manually annotated radio broadcasts sums
    up to 18 hours, 33 minutes and 57

  • ELRA-W0117 : Danish Propbank
    The Danish Propbank (DPB) is an
    87,000-token treebank from a variety of
    genres, annotated with morphosyntactic
    and semantic information, namely
    propositions/frames with VerbNet classes
    and semantic roles for both arguments
    and satellites. There are over 12,000
    frames with 32,000 role instances. The
    corpus has also been annotated with 20
    Named Entity classes and a 200-category
    semantic ontology for nouns.

  • ELRA-E0046 : ETAPE Evaluation Package
    The ETAPE Evaluation Package consists of
    ca. 30 hours of radio and TV data,
    selected to include mostly non planned
    speech and a reasonable proportion of
    multiple speaker data. All data were
    carefully transcribed, including named
    entity annotation. This package
    includes the material that was used for
    the ETAPE evaluation campaign. It
    includes resources, scoring tools,
    results of the campaign, etc., that were
    used or produced during the campaign.
    The aim of this evaluation package is to
    enable external players to evaluate
    their own system and compare their
    results with those obtained during the
    campaign itself.

  • ELRA-W0114 : TRAD Chinese-French Email Parallel corpus – Development Set
    This is a parallel corpus of 15,000
    characters in Chinese (equivalent to
    10,000 words) and a reference
    translation in French. The source texts
    are a selection of private emails
    collected from the daily life and
    business domains.

  • (last update: September 2017)

    Copyright © 2006 ELRA
    ELRACatalogue R&D 0.8.0