camel_transliterate¶
About¶
The camel_transliterate
tool allows you to transliterate text from one form
to another using one of the builtin transliteration schemes. It also allows
tokens to be prefixed with a marker to indicate that they should not be
transliterated.
Usage¶
Below is the usage information that can be generated by running
camel_transliterate --help
.
Usage:
camel_transliterate (-s SCHEME | --scheme=SCHEME)
[-m MARKER | --marker=MARKER]
[-I | --ignore-markers]
[-S | --strip-markers]
[-o OUTPUT | --output=OUTPUT] [FILE]
camel_transliterate (-l | --list)
camel_transliterate (-v | --version)
camel_transliterate (-h | --help)
Options:
-s SCHEME --scheme
Scheme used for transliteration.
-o OUTPUT --output=OUTPUT
Output file. If not specified, output will be printed to stdout.
-m MARKER --marker=MARKER
Marker used to prefix tokens not to be transliterated.
[default: @@IGNORE@@]
-I --ignore-markers
Transliterate marked words as well.
-S --strip-markers
Remove markers in output.
-l --list
Show a list of available transliteration schemes.
-h --help
Show this screen.
-v --version
Show version.
Below is a list of currently available transliteration schemes.
ar2bw Arabic to Buckwalter
ar2safebw Arabic to Safe Buckwalter
ar2xmlbw Arabic to XML Buckwalter
ar2hsb Arabic to Habash-Soudi-Buckwalter
bw2ar Buckwalter to Arabic
bw2safebw Buckwalter to Safe Buckwalter
bw2xmlbw Buckwalter to XML Buckwalter
bw2hsb Buckwalter to Habash-Soudi-Buckwalter
safebw2ar Safe Buckwalter to Arabic
safebw2bw Safe Buckwalter to Buckwalter
safebw2xmlbw Safe Buckwalter to XML Buckwalter
safebw2hsb Safe Buckwalter to Habash-Soudi-Buckwalter
xmlbw2ar XML Buckwalter to Arabic
xmlbw2bw XML Buckwalter to Buckwalter
xmlbw2safebw XML Buckwalter to Safe Buckwalter
xmlbw2hsb XML Buckwalter to Habash-Soudi-Buckwalter
hsb2ar Habash-Soudi-Buckwalter to Arabic
hsb2bw Habash-Soudi-Buckwalter to Buckwalter
hsb2safebw Habash-Soudi-Buckwalter to Safe Buckwalter
hsb2xmlbw Habash-Soudi-Buckwalter to Habash-Soudi-Buckwalter
Notes on markers¶
A marker a string with no whitespace characters at the beginning, middle, or
end of it (in otherwords, it’s a single token without padding spaces). As a
rule-of-thumb pick a marker that is not-likely to appear in your text. We
use @@IGNORE@@
as a default value, while some Arabic NLP tools use
@@LAT@@
to denote latin/foreign text.
Notes on schemes¶
The transliteration schemes ar2bw
, ar2safebw
, ar2xmlbw
,
ar2hsb
, bw2ar
, bw2safebw
, bw2xmlbw
, bw2hsb
,
safebw2ar
, safebw2bw
, safebw2xmlbw
, safebw2hsb
,
xmlbw2ar
, xmlbw2bw
, xmlbw2safebw
, xmlbw2hsb
,
hsb2ar
, hsb2bw
, hsb2safebw
, and hsb2xmlbw
,
use the conversion table listed in Encoding Schemes.
Input characters not listed in the conversion table are output as they appear
without any transliteration.