The Nemlar Broadcast News Speech Corpus consists of about 40 hours of Standard Arabic news broadcasts. The broadcasts were recorded from four different radio stations: Medi1, Radio Orient, RMC – Radio Monte Carlo, RTM – Radio Television Maroc. All files were recorded in linear PCM format, 16 kHz, 16 bit.
The NEMLAR Speech Synthesis Corpus contains the recordings of 2 native Egyptian Arabic speakers (male and female, 35 and 27 years old respectively) recorded in a studio over 2 channels (voice + laryngograph). The recordings comprise more than 10 hours of data with transcriptions.
The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories. The corpus is provided in 4 different versions: raw text, fully vowelized text, text with Arabic lexical analysis, text with Arabic POS-tags.