About uTeXer
uTeXer is a helper script written in Python to translate unicode math signs and Latin ligatures into latex and plain text to make them readible for blind computer users. It can be used for:
- translating formulas from web sites or from PDF's into LaTeX signs
- renice the output of the text versions of PDF's (generated e.g. by pdftotext from popplerutils); those are often Latin ligatures or other signs which make the texts harder to read.
With this, ligatures can be removed from papers or other scientific documents and mathematical equations can be made readable for blind people.
Download/Installation
You can optain a copy by using git:
git clone https://github.com/humenda/utexer.git
or here a zip file..
Installation on GNU/Linux
For running uTeXer, you need a working python3 installation. You can use
./install
which installs the program to /usr/local/*, or set PREFIX="/" to install it to /bin and /share directly (and /opt or /usr respectively).
You can run it also directly from the source(s).
Installation On Windows
As mentioned, a version of Python3 should be installed.
Currently, no installer exists and therefore the command line must be opened (Windows key + r, .html < enter>) and then the directory must be changed to that one of uTeXer by using "cd". The files to convert should be copied to uTeXer's directory and the usage is then like described below.
For the future I plan to ship an installer.
Using uTeXer
uTeXer is a simple program, the help screen should explain most:
Usage: utexer [options] INPUTFILE
If no output file is specified with the -o option, the input file will be
overwritten. If no input file is specified, stdin/stdout will be used (but you
can redirect stdout with -o too).
Options:
-h, --help show this help message and exit
-e ENC, --encoding=ENC
Set encoding for stdin (default UTF-8)
-l, --ligature replace ligatures through normal letters (at least in
Latin languages where they are only for better
readibility)
-o FILE, --output=FILE
set output file (if unset, overwrite input file)
-p, --pdftotext Replace some signs generated just by PDFtotext
-s, --strip-newpage Strip the newpage character
-u FILE, --userdict=FILE
set path to user-defined replacements/additions for
unicode mappings (format described in README)
Where Do The LaTeX-commands Come From / How Do I Customize Them?
The initial unicode table was downloaded from:
http://www.w3.org/Math/characters/unicode.xml
With the -u switch you can supply an additional unicode table to override (or even add) unicode points. The format is simple:
<decimal_number><tab><replacement>
Example:
123 \{
This allows you to customize LaTeX-commands. E.g. I don't like \varnothing, \emptyset seems more intuitive for me.
Make PDF's Readable
PDF's contain often ligatures and to read them painlessly with a screen reader, it is often not sufficient to just save the plain text or to copy the text from the PDF viewer. To get a usable result, one should extract the text using the poppler utils. This is a collection of command line programs which are available for GNU/Linux and Windows (see below). The text extracted by the command line program "pdftotext" can be then used as input for uTeXer and in many cases this yields a well usable result.
Download Of Poppler-Utils
- Windows: Poppler along with XPDF at http://www.foolabs.com/xpdf/download.html
- Ubuntu / Debian:
apt-get install poppler-utils
Known Issues
As said before, uTeXer can not fully translate formulas. Especially formulas who are bigger than a line, e.g. a fraction, indices and powers are (often) not recognized, just because they are not marked in unicode, but by changing their relative height. This only matters for PDF output, ob web pages, people often use tags to indicate subscripts and so on.
Overline and underlines are also lost!
There are signs in the unicode table which should not be translated or are translated to not commonly used LaTeX-commands:
- \varnothing instead of the more common \emptyset
- { } instead of \lbrace and \rbrace, since source code is also replaced