ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-W0042
    NEMLAR Written Corpus
    This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).

    The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:
    • Political news: 48,000 words
    • Political debate: 30,000 words
    • Islamic text (Preaching and others): 29,000 words
    • Phrases of common words: 8,500 words
    • Text from broadcast news: 5,500 words
    • Business: 20,000 words
    • Arabic literature: 30,000 words
    • General news: 100,000 words
    • Interviews: 56,000 words
    • Scientific press: 50,000 words
    • Sports press: 50,000 words
    • Dictionary entries explanation: 52,000 words
    • Legal domain text: 21,000 words

    The time span of the data included goes from late 1990’s to 2005.

    The corpus is provided in 4 different versions:
    • Raw text
    • Fully vowelized text
    • Text with Arabic lexical analysis
    • Text with Arabic POS-tags

    Diacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).

    The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.

    ISLRN : 050-693-158-326-9
    Production
    Project : NEMLAR (Network for Euro-Mediterranean LAnguage Resources)
    Technical Information
    development mode : Semi Automatic
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written corpus 
    Resource files
  • ICON_FILE_DOWNLOAD Validation report
  •  
    Members Prices
    Academic - Commercial 1000.00 EUR
    Academic - Research 150.00 EUR
    Commercial - Commercial 1000.00 EUR
    Commercial - Research 250.00 EUR
    Non Member Prices
    Academic - Commercial 2000.00 EUR
    Academic - Research 300.00 EUR
    Commercial - Commercial 2000.00 EUR
    Commercial - Research 500.00 EUR

    Special Prices

    Discounts are available if you purchase several NEMLAR resources (W0042, S0219 and S0220):
    • 15% discount for 2 resources,
    • 30% discount for 3 resources.


    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0