Considering the needs expressed by several academic institutions of the Human Language Technology field, ELDA is pleased to offer access to a version of its Catalogue of Language Resources dedicated to academic research. Indeed, at various occasions, while discussing with the players of the R&D academic community, we concluded to the importance to allow an easy and fast access to a list of resources more specifically produced for R&D purposes in Human Language Technology.
Thus, we now provide a list of Language Resources, available at very affordable prices, and dedicated to a research use. So as to facilitate the access to this list, we preserved the interface and browsing tools of the ELDA catalogue. Of course, at any time, you may choose to return to the full version of the catalogue. Very soon, we will also implement an advanced search which will allow you to browse through our catalogue thanks to pre-defined selection criteria, such as the type of resources or the prices available (and many more criteria).
Like the full version of the catalogue, the language resources available here are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".
1/ Spoken LRs
a - Telephone recordings
The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.
b - Desktop/Microphone recordings
The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.
c - Broadcast Resources
The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.
d - Speech Related Resources
You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.
2/ Written LRs
a - Corpora
This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.
b - Monolingual lexicons
The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.
c - Multilingual lexicons
Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.
3/ Terminological LRs
Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.
4/ Multimodal/Multimedia LRs
The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.
LATEST UPDATES :
|
 |
New Resources |
 |
S0307 : BABEL Polish database The BABEL Polish Database is a speech database that was produced by a research consortium funded by the European Union under the COPERNICUS programme (COPERNICUS Project 1304). It consists of the basic "common" set which contains the Many Talker Set (30 males, 30 females), the Few Talker Set (5 males, 5 females), the Very Few Talker Set (1 male, 1 female).
|
S0305 : EPAC Corpus: orthographic transcriptions This corpus consists of approx. 100 hours of manual orthographic transcriptions, which were produced from 1,677 hours of non transcribed recordings from the ESTER Evaluation Campaign (Technolangue programme). This corpus also consists of automatic transcriptions of the full 1,677 hours.
|
T0373 : BioLexicon BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources.
|
E0034 : EASy Evaluation Package The EASy Evaluation Package was produced within the French national project EASy (Evaluation of syntactic parsers of French), as part of the Technolangue programme funded by the French Ministry of Research and New Technologies (MRNT). The project enabled to carry out a campaign for the evaluation of syntactic parsers of French. This package includes the material that was used for the EASy evaluation campaign. It includes resources, protocols, scoring tools, results of the campaign, etc., that were used or produced during the campaign. The aim of these evaluation packages is to enable external players to evaluate their own system and compare their results with those obtained during the campaign itself. The campaign is distributed over two actions: evaluation of constituent and dependency relation annotations.
|
T0372-01 : Multilingual Dictionary of Sports – English-French-Greek-Arabic-German-Spanish-Portuguese multilingual database This dictionary was produced within the French national project EuRADic (European and Arabic Dictionaries and Corpora), as part of the Technolangue programme funded by the French Ministry of Industry. The current set consists of an English-French-Greek-Arabic-German-Spani sh-Portuguese multilingual database. It contains a nomenclature of 37,500 entries for English, French, Greek and Arabic, 20,000 entries for Spanish, 22,000 for German and 10,000 for Portuguese. For each language, the contents consist of:
• Mandatory information: term, grammar
• Mandatory information except if not available (no source) : reference/source,
• Mandatory and common information: field (sport), domain, additional circumscription
• Optional information: definition and source, linguistic and source reference, combinatorics, other form, synonym
|
| (last update: July 2010) |
|
|