ELRA ELRA
  Home Catalogue
Language Resources
Bug reports
Send us your bug reports.
Search Catalogue
 
Use keywords to find the product you are looking for.
Advanced Search
Languages
Anglais Français
Informations
  • Purchase procedure & Conditions

  • Pricing & user licences

  • How to promote your resources ?

  • Contact Us
  • Catalog Reference : ELRA-W0119
    Helsinki Corpus of Swahili
    This is a text corpus of Swahili language of 25 million words, annotated for part-of-speech, morphology and syntax. The corpus contains prose text from fiction, news media and government documents domains, from the period between 1953 and 2016.

    This package contains:
    - the Helsinki Corpus of Swahili 2.0 Non Annotated Version, which contains the raw material formatted and corrected.
    - the Helsinki Corpus of Swahili 2.0 Annotated version, annotated with Salama Tagger and with metadata added to each file.

    The source texts were collected from the Web (texts in news media between 1988-2016 and open government webpages between 2004 and 2006) and from books (between 1953 and 1991, scanned and proofread). Part of the oldest news material before the time of scanners was manually typed. Old material contains material collected before 2003: Books and News
    New material contains a section Bunge (Hansards of the Tanzanian Parliament from the years 2004, 2005 and 2006) and a section News (from 2004-2015).

    A word in the annotated corpus contains normally the following types of information: token, stem, part-of-speech, morphological description, syntactic tag, rest of verb description.

    The corpus was prepared at the University of Helsinki, Department of Asian and African Studies under auspices of Prof. Arvi Hurskainen.

    It is available from ELRA for commercial use only.
    For academic use, it is accessible via Kielipankki - the Language Bank of Finland in Korp (https://korp.csc.fi/).

    A corpus version with English glosses, where each word in corpus is provided with one or more lexical equivalents, can be distributed upon demand (terms to be discussed on a case by case basis).

    ISLRN : 941-187-059-145-7
    Identification
    Period of coverage :
    Version : 2.0
    Version history :
    Technical Information
    Distribution medium : Downloadable
    Contents Click on the arrow to display content.
    written corpus 
     
    Members Prices
    * For academic use, it is accessible via Kielipankki - the Language Bank of Finland in Korp (https://korp.csc.fi/).
    Academic - Commercial 7500.00 EUR
    Commercial - Commercial 7500.00 EUR
    Non Member Prices
    * For academic use, it is accessible via Kielipankki - the Language Bank of Finland in Korp (https://korp.csc.fi/).
    Academic - Commercial 15000.00 EUR
    Commercial - Commercial 15000.00 EUR

    Copyright © 2008 ELRA
    ELRACatalogue 0.8.0