camel_tools.dialectid¶
Danger
Note: This component is not available on Windows.
This module contains the CAMeL Tools dialect identification component. This Dialect Identification system can identify between 25 Arabic city dialects as well as Modern Standard Arabic. It is based on the system described by Salameh, Bouamor and Habash.
Classes¶
-
class
camel_tools.dialectid.
DIDPred
¶ A named tuple containing dialect ID prediction results.
-
class
camel_tools.dialectid.
DialectIdentifier
(labels=None, labels_extra=None, char_lm_dir=None, word_lm_dir=None)¶ A class for training, evaluating and running the dialect identification model described by Salameh et al. After initializing an instance, you must run the train method once before using it.
Parameters: - labels (
set
ofstr
, optional) – The set of dialect labels used in the training data in the main model. If None, the default labels are used. Defaults to None. - labels_extra (
set
ofstr
, optional) – The set of dialect labels used in the training data in the extra features model. If None, the default labels are used. Defaults to None. - char_lm_dir (
str
, optional) – Path to the directory containing the character-based language models. If None, use the language models that come with this package. Defaults to None. - word_lm_dir (
str
, optional) – Path to the directory containing the word-based language models. If None, use the language models that come with this package. Defaults to None.
-
predict
(sentences, output='label')¶ Predict the dialect probability scores for a given list of sentences.
Parameters: Returns: A list of prediction results, each corresponding to its respective sentence.
Return type:
-
static
pretrained
()¶ Load the default pre-trained model provided with camel-tools.
Raises: PretrainedModelError
– When a pre-trained model compatible with the current Python version isn’t available.Returns: The loaded model. Return type: DialectIdentifier
- labels (
-
class
camel_tools.dialectid.
DialectIdError
(msg)¶ Base class for all CAMeL Dialect ID errors.
-
class
camel_tools.dialectid.
UntrainedModelError
(msg)¶ Error thrown when attempting to use an untrained DialectIdentifier instance.
-
class
camel_tools.dialectid.
PretrainedModelError
(msg)¶ Error thrown when attempting to load a pretrained model provided with camel-tools.
Functions¶
-
camel_tools.dialectid.
label_to_city
(prediction)¶ Converts a dialect prediction using labels to use city names instead.
Parameters: pred ( DIDPred
) – The prediction to convert.Returns: DIDPred
The converted prediction.
-
camel_tools.dialectid.
label_to_country
(prediction)¶ Converts a dialect prediction using labels to use country names instead.
Parameters: pred ( DIDPred
) – The prediction to convert.Returns: DIDPred
The converted prediction.
-
camel_tools.dialectid.
label_to_region
(prediction)¶ Converts a dialect prediction using labels to use region names instead.
Parameters: pred ( DIDPred
) – The prediction to convert.Returns: DIDPred
The converted prediction.
-
camel_tools.dialectid.
label_city_pairs
()¶ Returns the set of default label-city pairs.
Returns: The set of default label-dialect pairs. Return type: frozenset
oftuple
Labels¶
Below is a table mapping output labels to their respective city, country, and region dialects:
Label | City | Country | Region |
---|---|---|---|
ALE | Aleppo | Syria | Levant |
ALG | Algiers | Algeria | Maghreb |
ALX | Alexandria | Egypt | Nile Basin |
AMM | Amman | Jordan | Levant |
ASW | Aswan | Egypt | Nile Basin |
BAG | Baghdad | Iraq | Iraq |
BAS | Basra | Iraq | Iraq |
BEI | Beirut | Lebanon | Levant |
BEN | Benghazi | Libya | Maghreb |
CAI | Cairo | Egypt | Nile Basin |
DAM | Damascus | Syria | Levant |
DOH | Doha | Qatar | Gulf |
FES | Fes | Morocco | Maghreb |
JED | Jeddha | Saudi Arabia | Gulf |
JER | Jerusalem | Palestine | Levant |
KHA | Khartoum | Sudan | Nile Basin |
MOS | Mosul | Iraq | Iraq |
MSA | Modern Standard Arabic | Modern Standard Arabic | Modern Standard Arabic |
MUS | Muscat | Oman | Gulf |
RAB | Rabat | Morocco | Maghreb |
RIY | Riyadh | Saudi Arabia | Gulf |
SAL | Salt | Jordan | Levant |
SAN | Sana’a | Yemen | Gulf of Aden |
SFX | Sfax | Tunisia | Maghreb |
TRI | Tripoli | Libya | Maghreb |
TUN | Tunis | Tunisia | Maghreb |
Examples¶
Below is an example of how to load and use the default pre-trained model.
from camel_tools.dialectid import DialectIdentifier
did = DialectIdentifier.pretrained()
sentences = [
'مال الهوى و مالي شكون اللي جابني ليك ما كنت انايا ف حالي بلاو قلبي يانا بيك',
'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]
predictions = did.predict(sentences)
# Each prediction is a tuple containing both the top prediction and the
# percentage score of each dialect. To get only the top prediction, we can
# do the following:
top_dialects = [p.top for p in predictions]