ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalogue of Language Resources

    ELRA releases free Language Resources.


    The ELRA Catalogue of Language Resources offers a repository of Language Resources (LRs) made available through ELRA.


    (See full-size image)

    An increasing number of LRs in the various fields of Human Language Technology (see image on the left-hand side) are distributed on behalf of ELRA via its operational body ELDA, thanks to the contribution of various players of the HLT community.

    Our aim is to provide Language Resources, by means of this repository, so as to prevent researchers and developers from investing efforts to rebuild resources which already exist as well as help them identify and access those resources.

    Other resources identified, but not available through ELRA, can be viewed in the Universal Catalogue.

    If you have any suggestions or comments, or need any further details about ELRA and its Catalogue of Language Resources, please refer to the contact us section.

    ELRA is a partner of OLAC (Open Language Archives Community). The catalogue can be viewed as an OLAC repository.

    New Resources
  • ELRA-L0098 : Arabic dictionary of inflected words
    This dictionary consists of a list of 6
    million inflected forms, fully
    vowelized, and tagged with grammatical
    information which includes POS and
    grammatical features, including number,
    gender, case, definiteness, tense, mood
    and compatibility with clitic
    agglutination. The data is formatted in
    conformity with the data formats of
    Unitex/GramLab. This dictionary is also
    available together with recognition of
    agglutinated clitics and inflection
    system in the ELRA Catalogue under
    reference ELRA-L0099.

  • ELRA-L0099 : Arabic dictionary of inflected words with recognition of agglutinated clitics and inflection system
    This dictionary consists of 6 million
    inflected forms, fully vowelized,
    generated in compliance with the
    grammatical rules of Arabic and tagged
    with grammatical information which
    includes POS and grammatical features,
    including number, gender, case,
    definiteness, tense, mood and
    compatibility with clitic agglutination.
    It is accompanied by a grammatical
    resource that recognizes hundreds of
    millions of valid agglutinated words. In
    order to be able to update the full-form
    dictionary, a dictionary of 65 000
    lemmas and the data required to inflect
    them and regenerate the full-form
    dictionary are also provided. The data
    is formatted in conformity with the data
    formats of Unitex/GramLab. This
    dictionary is also available without
    recognition of agglutinated clitics and
    without inflection system in the ELRA
    Catalogue under reference ELRA-L0098.

  • ELRA-W0119 : Helsinki Corpus of Swahili
    This is a text corpus of Swahili
    language of 25 million words, annotated
    for part-of-speech, morphology and
    syntax. The corpus contains prose text
    from domains such as fiction, news media
    and government documents, from the
    period between 1953 and 2016.

  • ELRA-W0120 : NUM 5M Mongolian written corpus
    This is a corpus of Mongolian text
    mostly from domains like online or
    printed daily newspapers, literature,
    and laws. Part of this corpus, about
    2,800 sentences with 100,000 words, has
    been POS-tagged manually and stored in
    XML TEI format.

  • ELRA-S0393 : Persian Speech Corpus
    This speech corpus was recorded through
    a "Blubbery" model microphone by one
    male speaker in Persian (Tehrani accent)
    in a professional studio. Synthesized
    speech as an output using this corpus
    has produced a high quality, natural
    voice. It consists of 399 utterances for
    a total of about 2.5 hours, with
    orthographic and phonetic
    transcriptions.

  • (last update: October 2017)

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0