Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-W0042
    NEMLAR Written Corpus
    This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220).

    The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are:
    • Political news: 48,000 words
    • Political debate: 30,000 words
    • Islamic text (Preaching and others): 29,000 words
    • Phrases of common words: 8,500 words
    • Text from broadcast news: 5,500 words
    • Business: 20,000 words
    • Arabic literature: 30,000 words
    • General news: 100,000 words
    • Interviews: 56,000 words
    • Scientific press: 50,000 words
    • Sports press: 50,000 words
    • Dictionary entries explanation: 52,000 words
    • Legal domain text: 21,000 words

    The time span of the data included goes from late 1990’s to 2005.

    The corpus is provided in 4 different versions:
    • Raw text
    • Fully vowelized text
    • Text with Arabic lexical analysis
    • Text with Arabic POS-tags

    Diacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases).

    The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.

    ISLRN : 050-693-158-326-9
    Project : NEMLAR (Network for Euro-Mediterranean LAnguage Resources)
    Technical Information
    development mode : Semi Automatic
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written corpus 
    Resource files
  • ICON_FILE_DOWNLOAD Validation report
    Members Prices
    Academic - Commercial 1000.00 EUR
    Academic - Research 150.00 EUR
    Commercial - Commercial 1000.00 EUR
    Commercial - Research 250.00 EUR
    Non Member Prices
    Academic - Commercial 2000.00 EUR
    Academic - Research 300.00 EUR
    Commercial - Commercial 2000.00 EUR
    Commercial - Research 500.00 EUR

    Special Prices

    Discounts are available if you purchase several NEMLAR resources (W0042, S0219 and S0220):
    • 15% discount for 2 resources,
    • 30% discount for 3 resources.

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0