Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-S0220
    NEMLAR Speech Synthesis Corpus
    This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Written Corpus (ELRA-W0042) and the NEMLAR Broadcast News Speech Corpus (ELRA-S0219).

    The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.

    Speech samples are stored in 96 kHz, 24 bit with the least significant byte first (“lohi” or Intel format) as (signed) integers.

    The speaker read 2,032 prompted sentences covering approx. 42,000 words in three categories: transcribed speech (6,600 words - 20%), written text (16,500 words - 50%), and constructed phrases (10,300 - 30%).

    The transcribed speech consists of text from different domains, being produced in the Broadcast news task. The written text consists of news excerpts, novels and short stories with short sentences. Each paragraph is presented on a separate prompt sheet.

    Constructed phrases consist of frequent phrases and diphone coverage sentences. The frequent used phrases are designed as derived from written text (article, news paper, etc.) and have been divided into six sub-domains:
    • Frequently used colloquial expressions
    • Sports/Games
    • News
    • Finance
    • Culture/Entertainment
    • Consumer Information
    The diphone coverage sentences cover the missing and rare diphones in all the data. To cover these diphones a large corpus about 150,000 words was used and from which the sentences were extracted.

    The database is provided with orthographic, prosodic and phonetic transcriptions in SAMPA. All transcriptions are segmented at the utterance (sentence/command word) level, annotated at the word level and checked manually. A pronunciation lexicon including 3,589 headwords with phonetics in SAMPA is also available.

    The database is distributed on 3 ISO 9660 DVD-ROM volumes. It has been validated by an external partner and a validation report is provided.

    ISLRN : 361-216-121-305-9
    Project : NEMLAR (Network for Euro-Mediterranean LAnguage Resources)
    Technical Information
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
     speech corpus 
    Resource files
  • ICON_FILE_DOWNLOAD Validation report - Report dedicated to male recordings
  • ICON_FILE_DOWNLOAD Validation report - Report dedicated to female recordings
    Members Prices
    Academic - Commercial 5000.00 EUR
    Academic - Research 500.00 EUR
    Commercial - Commercial 5000.00 EUR
    Commercial - Research 1250.00 EUR
    Non Member Prices
    Academic - Commercial 10000.00 EUR
    Academic - Research 1000.00 EUR
    Commercial - Commercial 10000.00 EUR
    Commercial - Research 2500.00 EUR

    Special Prices

    Discounts are available if you purchase several NEMLAR resources (W0042, S0219 and S0220):
    • 15% discount for 2 resources,
    • 30% discount for 3 resources.

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0