camel_diac¶
About¶
The camel_diac
tool allows you to diacritize Arabic text.
Usage¶
Below is the usage information that can be generated by running
camel_diac --help
.
Usage:
camel_diac [-d DATABASE | --db=DATABASE]
[-m MARKER | --marker=MARKER]
[-I | --ignore-markers]
[-S | --strip-markers]
[-p | --pretokenized]
[-o OUTPUT | --output=OUTPUT] [FILE]
camel_diac (-l | --list-schemes)
camel_diac (-v | --version)
camel_diac (-h | --help)
Options:
-d DATABASE --db=DATABASE
Morphology database to use. DATABASE could be the name of a builtin
database or a path to a database file. [default: calima-msa-r13]
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-m MARKER --marker=MARKER
Marker used to prefix tokens not to be transliterated.
[default: @@IGNORE@@]
-I --ignore-markers
Transliterate marked words as well.
-S --strip-markers
Remove markers in output.
-p --pretokenized
Input is already pre-tokenized by punctuation. When this is set,
camel_diac will not split tokens by punctuation but any tokens that
do contain punctuation will not be diacritized.
-l --list
Show a list of morphological databases.
-h --help
Show this screen.
-v --version
Show version.
Databases¶
We provide builtin databases to be able to run camel_diac
out of the box
that can be passed to -d
or --db
.
A list of available databases can be found at Databases.
You can always check what builtin databases are provided in your current
camel_tools
installation by running camel_diac --list
.
Alternatively, you can pass in a path to a database of your chosing instead of
one of the above listed databases.
If no database is specified, calima-msa-r13 is used.
Notes on markers¶
A marker a string with no whitespace characters at the beginning, middle, or
end of it (in otherwords, it’s a single token without padding spaces). As a
rule-of-thumb pick a marker that is not-likely to appear in your text. We
use @@IGNORE@@
as a default value, while some Arabic NLP tools use
@@LAT@@
to denote latin/foreign text.