Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-S0192
    GlobalPhone Arabic
    The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.

    The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 22 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swahili (ELRA-S0375), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Ukrainian (ELRA-S0377), and Vietnamese (ELRA-S0322).

    In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary. The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6). The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 2100 native adult speakers.

    Data is shortened by means of the shorten program written by Tony Robinson. Alternatively, the data could be delivered unshorten.

    The Arabic corpus was produced using the Assabah newspaper. It contains recordings of 78 speakers (35 males, 43 females) recorded in Tunisia, Palestine and Jordan. The following age distribution has been obtained: 20 speakers are below 19, 35 speakers are between 20 and 29, 13 speakers are between 30 and 39, 6 speakers are between 40 and 49, and 4 speakers are over 50.

    ISLRN : 720-473-726-952-8
    Applications existing : Language identification#Speaker identification#Speech recognition
    Technical Information
    Bytesize : approximately 2 Gb per language
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
     speech corpus 
    Resource files
    Members Prices
    Academic - Commercial 3000.00 EUR
    Academic - Research 600.00 EUR
    Commercial - Commercial 3000.00 EUR
    Commercial - Research 3000.00 EUR
    Non Member Prices
    Academic - Commercial 3600.00 EUR
    Academic - Research 700.00 EUR
    Commercial - Commercial 3600.00 EUR
    Commercial - Research 3600.00 EUR

    Special Prices

    Special prices for a purchase of several GlobalPhone languages
    (Member price - Non Member price):
    • 5 languages:
    R. 2600 R. 3000
    C. 13500 C. 16200
    • 10 languages:
    R. 5000 R. 6000
    C. 24000 C. 28800
    • 15 languages:
    R. 7500 R. 9000
    C. 31500 C. 37800
    • 20 languages:
    R. 10000 R. 12000
    C. 39000 C. 50000
    • 22 languages:
    R. 11200 R. 13400
    C. 45000 C. 57200

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0