Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
Use keywords to find the product you are looking for.
Advanced Search
Anglais Français
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-W0049
    "Le Monde Diplomatique" Arabic tagged corpus
    This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04).

    To each text are associated 3 files :
    - raw text in Arabic,
    - vowelized text in Arabic,
    - one XML file containing the morphological annotation of the text.

    Each text word associates a certain number of information, such as word size, rank of the word in the text, paragraph number where the word was found, etc. Each word associates a node in the XML file. Each node contains the following positional features of the word in the text:
    - Paragraph number in the text, i.e. paragraph where the word can be found,
    - Sentence number in the paragraph,
    - Sentence number in the text,
    - Rank of the word in the text,
    - Rank of the first character of the word in the text,
    - Word size.

    Information about word annotation are added as « sub-nodes »:
    - Word of non vowelised text,
    - Vowelised word,
    - Word lemma,
    - Grammatical category of the word.

    ISLRN : 124-139-628-259-2
    Project : EURADIC
    Technical Information
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written corpus 
    Members Prices
    Academic - Commercial 975.00 EUR
    Academic - Research 185.00 EUR
    Commercial - Commercial 975.00 EUR
    Commercial - Research 975.00 EUR
    Non Member Prices
    Academic - Commercial 2000.00 EUR
    Academic - Research 400.00 EUR
    Commercial - Commercial 2000.00 EUR
    Commercial - Research 2000.00 EUR

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0