ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • R&D Catalogue of Language Resources R&D Catalogue of Language Resources

    Considering the needs expressed by several academic institutions of the Human Language Technology field, ELDA is pleased to offer access to a version of its Catalogue of Language Resources dedicated to academic research. Indeed, at various occasions, while discussing with the players of the R&D academic community, we concluded to the importance to allow an easy and fast access to a list of resources more specifically produced for R&D purposes in Human Language Technology.

    Thus, we now provide a list of Language Resources, available at very affordable prices, and dedicated to a research use. So as to facilitate the access to this list, we preserved the interface and browsing tools of the ELDA catalogue. Of course, at any time, you may choose to return to the full version of the catalogue. Very soon, we will also implement an advanced search which will allow you to browse through our catalogue thanks to pre-defined selection criteria, such as the type of resources or the prices available (and many more criteria).

    Like the full version of the catalogue, the language resources available here are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".

    1/ Spoken LRs

    a - Telephone recordings
    The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.

    b - Desktop/Microphone recordings
    The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.

    c - Broadcast Resources
    The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.

    d - Speech Related Resources
    You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.

    2/ Written LRs

    a - Corpora
    This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.

    b - Monolingual lexicons
    The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.

    c - Multilingual lexicons
    Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.

    3/ Terminological LRs

    Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.

    4/ Multimodal/Multimedia LRs

    The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.


    LATEST UPDATES :

    New Resources
  • T0373 : BioLexicon
    BioLexicon is a large-scale English
    terminological resource which has been
    developed to address the needs emerging
    in text mining efforts in the biomedical
    domain. It contains over 2.2M lexical
    entries (over 3.3M semantic relations),
    and information on over 1.8M variants
    and on over 2M synonymy relations.
    BioLexicon is available in a relational
    database format (MySQL dump format) and
    it adheres to the EAGLES/ISO standards
    for lexical resources.

  • E0034 : EASy Evaluation Package
    The EASy Evaluation Package was produced
    within the French national project EASy
    (Evaluation of syntactic parsers of
    French), as part of the Technolangue
    programme funded by the French Ministry
    of Research and New Technologies (MRNT).
    The project enabled to carry out a
    campaign for the evaluation of syntactic
    parsers of French. This package includes
    the material that was used for the EASy
    evaluation campaign. It includes
    resources, protocols, scoring tools,
    results of the campaign, etc., that were
    used or produced during the campaign.
    The aim of these evaluation packages is
    to enable external players to evaluate
    their own system and compare their
    results with those obtained during the
    campaign itself. The campaign is
    distributed over two actions: evaluation
    of constituent and dependency relation
    annotations.

  • T0372-07 : Multilingual Dictionary of Sports – English-French-Portuguese trilingual database
    This dictionary was produced within the
    French national project EuRADic
    (European and Arabic Dictionaries and
    Corpora), as part of the Technolangue
    programme funded by the French Ministry
    of Industry. The current set consists of
    an English-French-Portuguese trilingual
    database which includes the following
    information for each
    language: • Mandatory information:
    term, reference/source,
    grammar • Mandatory and common
    information: field (sport), domain,
    additional circumscription • Optional
    information: definition and source OR
    linguistic and source reference,
    combinatorics, other form, synonym,
    variant

  • M0042 : ItalWordNet (Italian WordNet)
    ItalWordNet (Italian WordNet) is an
    updated version of the EuroWordNet
    Italian database. The ItalWordNet
    database was produced within a national
    Italian programme called SI-TAL. It
    contains a total of 49,360 synsets. The
    ItalWordNet is provided in XML format.
    The original EuroWordNet Italian
    database is also included in this
    package.

  • W0051 : English-Persian parallel Corpus
    The corpus consists of about 3,500,000
    English and Persian words aligned at
    sentence level (about 100,000
    sentences). The format of the files is
    Unicode. It has been originally created
    with SQL Server, but it is presented in
    access file type.

  • (last update: February 2010)

    Copyright © 2006 ELRA
    ELRACatalogue R&D 0.8.0