camel_dediac¶
About¶
The camel_dediac
tool allows you to dediacritize Arabic text in multiple
encoding schemes.
Usage¶
Below is the usage information that can be generated by running
camel_dediac --help
.
Usage:
camel_dediac [-s <SCHEME> | --scheme=<SCHEME>]
[-m <MARKER> | --marker=<MARKER>]
[-I | --ignore-markers]
[-S | --strip-markers]
[-o OUTPUT | --output=OUTPUT] [FILE]
camel_dediac (-l | --list)
camel_dediac (-v | --version)
camel_dediac (-h | --help)
Options:
-s <SCHEME> --scheme=<SCHEME>
The encoding scheme of the input text. [default: ar]
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-m <MARKER> --marker=<MARKER>
Marker used to prefix tokens not to be de-diacritized.
[default: @@IGNORE@@]
-I --ignore-markers
De-diacritize words prefixed with a marker.
-S --strip-markers
Remove prefix markers in output if --ignore-markers is set.
-l --list
Show a list of available input encoding schemes.
-h --help
Show this screen.
-v --version
Show version.
Below is a list of currently available encoding schemes.
ar Arabic script
bw Buckwalter encoding
safebw Safe Buckwalter encoding
xmlbw XML Buckwalter encoding
hsb Habash-Soudi-Buckwalter encoding
See Encoding Schemes for more information on encodings.
Notes on markers¶
A marker a string with no whitespace characters at the beginning, middle, or
end of it (in otherwords, it’s a single token without padding spaces). As a
rule-of-thumb pick a marker that is not-likely to appear in your text. We
use @@IGNORE@@
as a default value, while some Arabic NLP tools use
@@LAT@@
to denote latin/foreign text.