ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-L0031
    Dutch PAROLE lexicon
    The entry list of the lexicon consists of about 20,200 entries distributed over 13 parts of speech (POS). The entries have been described along the dimensions of morphosyntax and syntax. Morphosyntactic information consists of various lexical properties, like gender, number, case, person, inflection, etc. Syntactic descriptions consist of typical complementation patterns associated with the various lemmata.

    The composition of the entry list of the lexicon is based on 3 corpora from the Instituut voor Nederlandse Lexicologie (INL) and 2 lexica. The corpora contain a total of about 54 million words and have been automatically annotated for part-of-speech and lemma. The lexica contain morphosyntactic information of various kinds. For verbs, nouns, adjectives and adverbs, lemmata that were covered by at least 2 corpora and the 2 lexica were selected on the basis of cumulative frequency, coverage (distribution over sources) and inflected forms. For the smaller parts of speech, these selection requirements appeared to be too strict. Entry selection for these parts of speech was based on ranked frequency.

    The entries, uniquely defined by the combination of part of speech (e.g. noun) and subtype (e.g. common vs. proper noun), are provided with morphosyntactic information according to the Dutch set of PAROLE categories and features, and, where available, with syntactic information. Morphosyntactic information is automatically extracted from the INL lexica. Syntactic data have been collected manually, by inspection of corpus data and - where necessary - consultation of reference works. The corpus consulted consists of the newspaper component and the varied component of the 38 Million Words Corpus 1996.

    Word forms in the Dutch PAROLE lexicon are not inflected according to general paradigms, but are related to their lemma by a set of string procedures. These procedures are not unique. They can be shared by many other word forms. An example is suffixation with -e for adjectives, which produces "goede"/good from "goed". Inflected forms can be derived directly by applying the string procedures to the lemma they are connected with.

    The lexicon is set up as an SGML file (over 30 MB of plain ASCII). Its contents have been encoded in a distributed manner: all formative entities (like lemmata, syntactic phrases, feature bundles) are SGML entities, related by a pointer mechanism to other entities.

    The lexicon contains the following categories : adjectives (3,298 entries), adpositions (80 entries), adverbs (554 entries), articles (3 entries), conjunctions (70 entries), determiners (59 entries), interjections (235 entries), nouns (12,279 entries), numerals (77 entries), pronouns (85 entries), residuals (186 entries), unique (1 entry), verb (3,274 entries).

    ***
    Introduction on the PAROLE project

    LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages.

    Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998.

    PAROLE Corpora:

    The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories “Book”, “Newspaper”, “Periodical” and “Miscellaneous” within a settled range.

    The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept.

    As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part-of-speech (PoS) only.

    The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish.

    PAROLE Lexica:

    The lexica (20,000 entries per language) were built conform to a model based on EAGLES guidelines and GENELEX results, underlying a common lexical tool adapted from the EUREKA-GENELEX project. This software tool was extended to support the PAROLE model and conversion and management processes of the resulting resources.

    The languages involved in PAROLE lexica are: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.

    ISLRN : 283-192-505-981-6
    Production
    Project : LE-PAROLE project (MLAP/LE2-4017) Creation date : 1996-1998
    Technical Information
    Bytesize : 30 Mb
    Distribution medium : Downloadable
    Fileformat : Plain text
    Contents Click on the arrow to display content.
    written lexicon 
     
    Members Prices
    Academic - Commercial 8000.00 EUR
    Academic - Research 300.00 EUR
    Commercial - Commercial 8000.00 EUR
    Commercial - Research 1600.00 EUR
    Non Member Prices
    Academic - Commercial 10000.00 EUR
    Academic - Research 400.00 EUR
    Commercial - Commercial 10000.00 EUR
    Commercial - Research 3000.00 EUR

    Special Prices

    Special price for academic users from the Netherlands and Belgium. Tthe data are supplied directly by the Instituut voor Nederlandse Lexicologie, http://www.inl.nl.

    Members Special Prices
    :Academic - Research 200.00 EUR
    Non Members Speciaux Prices
    :Academic - Research 200.00 EUR

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0