Send us your bug reports.
Use keywords to find the product you are looking for.
Purchase procedure & Conditions
Pricing & user licences
How to promote your resources ?
Catalog Reference : ELRA-W0043
PAROLE Italian Corpus
The PAROLE Italian Corpus comprises 3,135,651 words collected from four different domains:
• newspapers: 2,179,800 words from La Stampa, La Repubblica, Il Corriere della Sera, L’Unione Sarda, Il Sole 24ore, between 1992 and 1996,
• periodicals: 143,810 words from Casaviva, 100cose, Epoca, Espansione, Grazia, Panorama, Starbene, Storia Illustrata, Zerouno, between 1985 and 1988,
• books: 564,964 words, between 1970 and 1989,
• miscellaneous: 247,077 words from CNR documents, Patents, Maritime documents, Theater, between 1987 and 1997.
About 250,000 words were morphosyntactically annotated and lemmatized.
Introduction on the PAROLE project
LE-PAROLE project (MLAP/LE2-4017) aims to offer a large-scale harmonised set of "core" corpora and lexica for all European Union languages.
Language corpora and lexica were built according to the same design and composition principles, in the period 1996-1998.
The harmonisation with respect to corpus composition (selection of corpus texts) was to be achieved by the obligatory application of common parameters for time of production and classification according to publication medium. No texts older than 1970 were allowed. As for publication medium, the corpus had to include specific proportions of texts from the categories “Book”, “Newspaper”, “Periodical” and “Miscellaneous” within a settled range.
The harmonisation effort also applied to the textual and linguistic encoding of the language corpora involved. With respect to the mark up of text structure and primary data, every single corpus text was to be encoded according to the PAROLE DTD, which is compatible with the DTD of the Text Encoding Initiative (TEI) and with that of the Corpus Encoding Standard (CES). The level of encoding was set to Level 1 of the CES, implying the encoding of text structure and textual features up to Paragraph Level, with the additional constraint, however, that all legacy data was kept.
As for linguistic corpus annotation, an equal proportion of the corpus texts (up to 250,000 running words) was to be morphosyntactically annotated according to a common core PAROLE tagset, extended with a set of language specific features. The checking of the tags was split in two: 50,000 words had to be checked for maximum granularity and 200,000 for part-of-speech (PoS) only.
The languages involved in PAROLE corpora are: Belgian French, Catalan, Danish, Dutch, English, French, Finnish, German, Greek, Irish, Italian, Norwegian, Portuguese and Swedish.
The lexica (20,000 entries per language) were built conform to a model based on EAGLES guidelines and GENELEX results, underlying a common lexical tool adapted from the EUREKA-GENELEX project. This software tool was extended to support the PAROLE model and conversion and management processes of the resulting resources.
The languages involved in PAROLE lexica are: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish and Swedish.
Period of coverage :
Version history :
Update frequency: every 3 years Last update: 2004
Creation date :
Applications existing :
Distribution medium :
Click on the arrow to display content.
Number of languages
Character set :
ILC tagset and Parole tagset (EAGLES conformant)
Number of tokens :
Academic - Research 100.00 EUR
Commercial - Research 100.00 EUR
Non Member Prices
Academic - Research 150.00 EUR
Commercial - Research 150.00 EUR
Thursday 20 July, 2017
23468492 requests since Monday 27 September, 2004
Copyright © 2008