The language resources available in this catalogue are distributed into 4 categories : "Speech and Related Resources", "Written Resources", "Terminological Resources", and "Multimodal/Multimedia Resources".
1/ Spoken LRs
a - Telephone recordings
The databases catalogued in this section have been produced with speaker recordings made over the telephone (fixed or mobile) network, or through a microphone. You will find speech resources recorded in various environments, and covering a large number of European and non-European languages, e.g. the databases produced in the framework of the SpeechDat project.
b - Desktop/Microphone recordings
The databases catalogued in this section have been produced with speaker recordings made over a microphone, e.g. the databases produced in the framework of the BABEL project databases.
c - Broadcast Resources
The databases catalogued in this section have been produced with speaker recordings made over radio, television or internet, such as the Italian Broadcast News Corpus.
d - Speech Related Resources
You will find in this section pronunciation and phonetic lexicons, such as BDLEX, PHONOLEX, and MHATLEX databases.
2/ Written LRs
a - Corpora
This section contains monolingual and multilingual corpora, parallel or not, which may also be annotated. A few examples of the kind of resources you will find in this section are e.g. the corpora developed in the framework of the MULTEXT project, the Multilingual and Parallel Corpora (MLCC), French scientific corpora, newspaper corpora in Arabic, etc.
b - Monolingual lexicons
The section dedicated to monolingual lexicons contains various types of dictionaries, e.g. a dictionary of French verbs, the Japanese word dictionary, some PAROLE lexicons in many languages, etc.
c - Multilingual lexicons
Here you can find either bilingual or multilingual dictionaries and lexicons, such as the EuroWordNet databases.
3/ Terminological LRs
Monolingual, bilingual and multilingual terminological databases are available. They cover a large number of specialised domains, e.g. automobile engineering, insurance, linguistics, finance, etc., in a wide variety of languages.
4/ Multimodal/Multimedia LRs
The resources you will find in this section have been produced using different modalities, including the speech. An example of such resources is the database produced in the framework of the M2VTS project.
|
 |
New Resources |
 |
M0043 : Russian => English MT optimized lexicon in OLIF XML This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 99,211 entries in its source language (Russian) and 134,828 entries in its target language (English). The source entries are distributed as follows: 64,487 nouns, 11,470 adjectives, 19,724 verbs, 1,762 adverbs, and 1,768 closed-class elements (interjections, special prefixes, suffixes, etc.). Nouns contain gender and number information and verbs provide details on aspect and reflexivity. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Moreover, definitions are available for 59,775 entries, as well as collocational information for 39,148 entries.
|
M0045 : Cebuano => English Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 1,988 entries in Cebuano and 1,990 in English. The source entries are distributed as follows: 1,052 nouns, 462 adjectives, 405 verbs and 69 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 500 entries.
|
M0044 : English => Swahili Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 58,247 entries in English and 58,300 in Swahili. The source entries are distributed as follows: 36,046 nouns, 3,013 adjectives, 18,308 verbs and 880 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 17,570 entries.
|
M0046 : English => Czech Bilingual Lexicon This lexicon is provided in structured XML of OLIF (Open Lexicon Interchange Format) format. It comprises 31,718 entries in English and 32,125 in Czech. The source entries are distributed as follows: 17,797 nouns, 7,748 adjectives, 6,039 verbs and 134 closed-class entries. The entries contain semantic information in terms of domain specification or style information (e.g., colloquial, regional use, etc.). Collocational information is also available for 3,065 entries.
|
S0273 : LC-STAR Slovenian Phonetic lexicon The LC-STAR Slovenian Phonetic lexicon comprises 110,900 entries, including a set of 64,521 common words, a set of 45,012 proper names (including person names, family names, cities, streets, companies and brand names) and a list of 5,491 special application words. The lexicon is provided in XML format and includes phonetic transcriptions in SAMPA.
|
| (last update: May 2008) |
|
|