ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-W0019
    Dutch PAROLE Distributable Corpus
    The Dutch PAROLE Distributable Corpus is a 3 million words selection from the 20 million words Dutch PAROLE Reference corpus.

    The Dutch corpus annotation and checking was made accordingly to the common core PAROLE tagset. The Dutch data were also checked for type.

    The Dutch PAROLE Distributable Corpus contains the following texts:

    BOOKS:
    Van Sterkenburg:
    Wdlijst tot wdboek, 1984, 65,344 words
    Taal vt Journaal, 1989, 56,215 words
    WNT-portret, 1992, 60,133 words

    NEWSPAPERS
    Short Newspaper texts:
    MN_Collection, 1986-1988, 19,537 words
    CVNP(S)-Collection, 1983-1990, 179,220 words

    PERIODICAL:
    Short texts from
    - Local Papers, 1985-1988, 47,019 words
    - Magazines, 1985-1989, 164,589 words

    MISCELLANEOUS:
    Texts to be read out in TV-news broadcasts for:
    - General audience, 1992-1995, 1,285,824 words
    - Youth, 1991-1995, 1,008,658 words
    Short texts from Ephemera, 1985-1986, 131,692 words

    TOTAL: 3,018,231 words

    Over 250,000 words of corpus texts have been PoS-tagged automatically. A total of 59,798 running words has been manually corrected and checked at least two times with respect to maximal granularity, according to a lexicographer's manual. The extra 9,000 words over the required 50,000 words compensate for the occurrence of ca. 5,300 "keywords" in the original texts. The fully corrected material has been subjected to an automated post-control operation, checking the pertinence relations between the various feature values, and instantiating default values in case a mismatch (indicating a correction error) was found. Ca. 200,000 words have been checked once for PoS and type. In addition to the required PoS, type was checked for reasons of quality. This material has been subjected to an automated correction procedure addressing the feature slots (positions) beyond the first two for PoS and type so as to solve discrepancies between the manually corrected PoS and type, and the possibly erroneous, automatically assigned values of the remaining slots.

    ***
    Introduction on the PAROLE project

    LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages.

    Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998.

    PAROLE Corpora:

    The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories “Book”, “Newspaper”, “Periodical” and “Miscellaneous” within a settled range.

    The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept.

    As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part-of-speech (PoS) only.

    The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish.

    PAROLE Lexica:

    The lexica (20,000 entries per language) were built conform to a model based on EAGLES guidelines and GENELEX results, underlying a common lexical tool adapted from the EUREKA-GENELEX project. This software tool was extended to support the PAROLE model and conversion and management processes of the resulting resources.

    The languages involved in PAROLE lexica are: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.

    ISLRN : 440-290-917-102-7
    Production
    Project : PAROLE
    Technical Information
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written corpus 
     
    Members Prices
    Academic - Commercial 1600.00 EUR
    Academic - Research 270.00 EUR
    Commercial - Commercial 1600.00 EUR
    Commercial - Research 800.00 EUR
    Non Member Prices
    Academic - Commercial 2500.00 EUR
    Academic - Research 300.00 EUR
    Commercial - Commercial 2500.00 EUR
    Commercial - Research 1300.00 EUR

    Special Prices

    Special price for academic users from the Netherlands and Belgium. The data are supplied directly by the Instituut voor Nederlandse Lexicologie, http://www.inl.nl.

    Members Special Prices
    :Academic - Research 150.00 EUR
    Non Members Speciaux Prices
    :Academic - Research 150.00 EUR

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0