ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : S0192
    GlobalPhone Arabic
    GlobalPhone is a multilingual speech and text database collected at Karlsruhe University, Germany. The GlobalPhone corpus provides transcribed speech data for the development and evaluation of large vocabulary continuous speech recognition systems in the most widespread languages of the world. GlobalPhone is designed to be uniform across languages with respect to the amount of text and audio per language, the audio data quality (microphone, noise, channel), the collection scenario (task, setup, speaking style etc.), and the transcription conventions. As a consequence, GlobalPhone supplies an excellent basis for research in the areas of (1) multilingual speech recognition, (2) rapid deployment of speech processing systems to new languages, (3) language and speaker identification tasks, as well as (4) monolingual speech recognition in a large variety of languages.

    To date, the GlobalPhone corpus covers 15 languages Arabic (Modern Standard Arabic), Chinese-Mandarin, Chinese-Shanghai, Croatian, Czech, French, German, Japanese, Korean, Portuguese (Brazilian), Russian, Spanish (Latin American), Swedish, Tamil, and Turkish. This selection covers a broad variety of language peculiarities relevant for Speech and Language Research and Development. It comprises wide-spread languages (Arabic, Chinese, Spanish), contains economically and politically important languages (Korean, Japanese, Arabic), and spans over wide geographical areas (Europe, America, Asia). The spoken speech covers a wide selection of phonetic characteristics, e.g. tonal sounds (Mandarin, Shanghai), pharyngal sounds (Arabic), consonantal clusters (German), nasals (French, Portuguese), palatized sounds (Russian), and more. The written language contains large orthographic variations, such as phonologic scripts (alphabetic scripts such as Roman, Cyrillic, Arabic; syllable-based scripts like Japanese Kana, Korean Hangul), and ideographic scripts (Chinese Hanzi and Japanese Kanji). The languages cover many morphological variations, e.g. agglutinative languages (Turkish, Korean), compounding languages (German), and also include scripts that completely lack word segmentation (Chinese).

    The data acquisition was performed in countries where the language is officially spoken. In each language about 100 adult native speakers were asked to read 100 sentences. The read texts were selected from national newspaper articles available from the web to cover a wide domain with large vocabulary. The articles report national and international political news, as well as economic news mostly from the years 1995-1998. The speech data was recorded with a Sennheiser 440-6 close-speaking microphone and is available in identical characteristics for all languages: PCM encoding, mono quality, 16-bit quantization, and 16 kHz sampling rate. Most of the speech data was recorded in a quiet office, some are recorded in apartments, i.e. living room. The transcriptions are available in the original script of the corresponding language. In addition, all transcriptions have been romanized, i.e. transformed into Roman script applying customized mapping algorithms. The transcripts are validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects such as breathing, laughing, and hesitations. Speaker information, such as age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 300 hours of speech spoken by more than 1500 native adult speakers. The data are divided in speaker disjoint sets for training, development, and evaluation (80:10:10) and are organized by languages and speakers.

    The Arabic corpus was produced using the Assabah newspaper. It contains recordings of 78 speakers (35 males, 43 females) recorded in Tunisia, Palestine and Jordan. The following age distribution has been obtained: 20 speakers are below 19, 35 speakers are between 20 and 29, 13 speakers are between 30 and 39, 6 speakers are between 40 and 49, and 4 speakers are over 50.

    For further information, please visit the following website: http://www.cs.cmu.edu/~tanja/GlobalPhone
    Applications
    Applications existing : Language identification#Speaker identification#Speech recognition
    Technical Information
    Bytesize : approximately 2 Gb per language
    Distribution medium : DVD
    Contents Click on the arrow to display content.
     speech corpus 
    Resource files
  • ICON_FILE_DOWNLOAD TEXT_QQC
  •  
    Members Prices
    Academic - Commercial 3000.00 EUR
    Academic - Research 600.00 EUR
    Commercial - Commercial 3000.00 EUR
    Commercial - Research 3000.00 EUR
    Non Member Prices
    Academic - Commercial 3600.00 EUR
    Academic - Research 700.00 EUR
    Commercial - Commercial 3600.00 EUR
    Commercial - Research 3600.00 EUR

    Special Prices

    Special prices for a purchase of several GlobalPhone languages
    (Member price - Non Member price):
    • 5 languages:
    R. 2600 R. 3000
    C. 13500 C. 16200
    • 10 languages:
    R. 5000 R. 6000
    C. 24000 C. 28800
    • 15 languages:
    R. 7500 R. 9000
    C. 31500 C. 37800


    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0