Annotated tweet corpus in Arabizi, French and English – ELRA Catalogue

Last view: 2025-07-07

27 Last view: 2025-07-07

Annotated tweet corpus in Arabizi, French and English

View resource name in all available languages

Corpus de tweets annotés en arabizi, français et anglais

ISLRN: 482-848-308-105-6

ID:

ELRA-W0323

The annotated tweet corpus in Arabizi, French and English was built by ELDA on behalf of INSA Rouen Normandie (Normandie Université, LITIS team), in the framework of the SAPhIRS project (System for the Analysis of Information Propagation in Social Networks), funded by the DGE (Direction Générale des Entreprises, France) through the RAPID programme (2017-2020). This project aimed at studying the mechanisms of information and opinion propagation within social networks: identifying influential leaders, detecting channels for disseminating information and opinion. The purpose of the corpus constitution, completed in 2020, was to collect and annotate tweets in 3 languages (Arabizi, French and English) for 3 predefined themes (Hooliganism, Racism, Terrorism).

For the collection, a tool has been developed in Python (based on the “GetOldTweets3” library) which used information such as the language (EN/FR) and a keyword list as input. With this tool, a maximum of 10,000 tweets per (keyword, language) pair were collected for English and French. For Arabizi, a specific process was setup, consisting in creating a vocabulary list in Arabizi from a corpus of Arabizi SMS (for Moroccan and Tunisian) and Training and test data for Arabizi detection and transliteration (available from ELRA under reference ELRA-W0126, ISLRN ID: 986-364-744-303-9) by selecting the 1000 most frequent words, and downloading the tweets containing each word from this vocabulary and keyword list (places = Morocco, Tunisia, Algeria). The tweets that were kept had to contain at least 5 words in Arabizi.

For the annotation, a tool running on Django has been developed in order to provide the following annotations for each tweet in a given sequence:
• Theme: with 5 possible annotations (Hooliganism, Racism, Terrorism, Others, Incomprehensible)
• Topic: the annotator can add a new topic if it does not exist in the proposed list
• Opinion: 3 possible annotations (Negative, Neutral, Positive)

In total, 17,103 sequences were annotated from 585,163 tweets (196,374 in English, 254,748 in French and 134,041 in Arabizi), including the themes “Others” and “Incomprehensible”. Among these sequences, 4,578 sequences having at least 20 tweets annotated with the 3 predefined themes (Hooliganism, Racism and Terrorism) were obtained, including 1,866 sequences with an opinion change. They are distributed as follows: 2,141 sequences in English (57,655 tweets), 1,942 sequences in French (48,854 tweets) and 495 sequences in Arabizi (21,216 tweets). A sub-corpus of 8,733 tweets (1,209 in English, 3,938 in French and 3,585 in Arabizi) annotated as “hateful”, according to topic/opinion annotations and by selecting tweets that contained insults, is also provided. The data are provided in CSV format.

Remark: this corpus includes only tweet IDs and corresponding annotations. Original tweets may be obtained by using the Twitter API.

View resource description in French

Le Corpus de tweets annotés en arabizi, français et anglais a été produit par ELDA pour le compte de l’INSA Rouen Normandie (Normandie Université, équipe LITIS), dans le cadre du projet SAPhIRS (Système pour l'Analyse de la Propagation d'Information dans les Réseaux Sociaux), financé par la DGE (Direction Générale des Entreprises, France) via le programme RAPID (2017-2020). Le projet visait à étudier les mécanismes de propagation d’information et d’opinion au sein des réseaux sociaux: repérer des leaders d’influence, détecter des canaux de diffusion d’information et d’opinion. L’objectif de la constitution de ce corpus, finalisé en 2020, était de collecter et annoter des tweets en 3 langues (arabizi, français et anglais) sur 3 thèmes prédéfinis (hooliganisme, racism et terrorisme).

Pour la collecte, un outil a été développé sous Python (basé sur la librairie “GetOldTweets3”), en utilisant des informations telles que la langue (anglais/français) et une liste de mots-clés en entrée. Grâce à cet outil, un maximum de 10000 tweets par couple (mot-clé, langue) a pu être collecté pour l’anglais et le français. Pour l’arabizi, un processus spécifique a été établi, consistant en la creation d’une liste de vocabulaire en arabizi à partir d’un corpus de SMS en arabizi (pour le marocain et le tunisien), ainsi que les Données de test et d’entraînement pour la détection et la translittération de l’arabizi (disponible via ELRA sous la référence ELRA-W0126, ISLRN ID: 986-364-744-303-9), en sélectionnant les 1000 mots les plus fréquents, et en téléchargeant les tweets contenant chaque mot de ces listes de vocabulaires et de mots-clés (lieux = Maroc, Tunisie, Algérie). Les tweets qui ont été conservés devaient contenir au moins 5 mots en arabizi.

Pour l’annotation, un outil tournant sous Django a été développé afin de fournir pour chaque tweet dans une séquence donnée les annotations suivantes:
• Thème: avec 5 annotations possibles (hooliganisme, racisme, terrorisme, autre et incompréhensible)
• Sujet: l’annotateur peut ajouter un nouveau sujet s’il n’existe pas dans la liste proposée
• Opinion: 3 annotations possibles (négatif, neutre, positif)

Au total, 17103 séquences ont été annotées à partir de 585163 tweets (196374 en anglais, 254748 en français et 134041 en arabizi), incluant l’annotatoin des thèmes “autre” et “incompréhensible”. A partir de ces séquences, 4578 séquences ayant au moins 20 tweets annotés avec 3 thèmes prédéfinis (hooliganisme, racisme et terrorisme) ont été obtenues, y compris 1866 séquences contenant des changements d’opinion. Celles-ci sont réparties comme suit: 2141 séquences en anglais (57655 tweets), 1942 séquences en français (48854 tweets) et 495 séquences en arabizi (21216 tweets). Un sous-corpus de 8733 tweets (1209 en anglais, 3938 en français et 3585 en arabizi) annoté comme “haineux” d’après les annotations en sujet/opinion et en sélectionnant les tweets qui contenaient des insultes est également fourni. Les données sont fournies au format CSV.

Remarque: Ce corpus comprend uniquement les identifiants de tweets et les annotations correspondantes. Les tweets d’origine peuvent être récupérés en utilisant l’API Twitter.

MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	7000.00 €
Licence: Commercial Use - ELRA VAR	7000.00 €	7000.00 €

NON MEMBER	academic	commercial
Licence: Non Commercial Use - ELRA END USER	0.00 €	10000.00 €
Licence: Commercial Use - ELRA VAR	10000.00 €	10000.00 €

DistributionAvailability start date 05/04/2022 Contact Person

Valérie Mapelli

text

Multilingual text corpusLanguages

French

Language Script: Latin

English

Language Script: Latin

Arabic

Language Script: Latin

Linguality

Linguality type: Multilingual

Multi-linguality type: Comparable

Size

17,103 Entries

Metadata

Created: 04/05/2022

Last Updated: 04/05/2022

Metadata Language: French, English (fr, en)

Version

Version: 1.0

People who looked at this resource also viewed the following: