uTeXer - unicode signs to LaTeX

About uTeXer

uTeXer is a helper script written in Python to translate unicode math signs and Latin ligatures into latex and plain text to make them readible for blind computer users. It can be used for:

  • translating formulas from web sites or from PDF's into LaTeX signs
  • renice the output of the text versions of PDF's (generated e.g. by pdftotext from popplerutils); those are often Latin ligatures or other signs which make the texts harder to read.

With this, ligatures can be removed from papers or other scientific documents and mathematical equations can be made readable for blind people.

Download/Installation

You can optain a copy by using git:

git clone https://github.com/humenda/utexer.git

or here a zip file..

Installation on GNU/Linux

For running uTeXer, you need a working python3 installation. You can use

./install

which installs the program to /usr/local/*, or set PREFIX="/" to install it to /bin and /share directly (and /opt or /usr respectively).

You can run it also directly from the source(s).

Installation On Windows

As mentioned, a version of Python3 should be installed.

Currently, no installer exists and therefore the command line must be opened (Windows key + r, .html < enter>) and then the directory must be changed to that one of uTeXer by using "cd". The files to convert should be copied to uTeXer's directory and the usage is then like described below.

For the future I plan to ship an installer.

Using uTeXer

uTeXer is a simple program, the help screen should explain most:

Usage: utexer [options] INPUTFILE

If no output file is specified with the -o option, the input file will be
overwritten. If no input file is specified, stdin/stdout will be used (but you
can redirect stdout with -o too).

Options:
  -h, --help            show this help message and exit
  -e ENC, --encoding=ENC
                        Set encoding for stdin (default UTF-8)
  -l, --ligature        replace ligatures through normal letters (at least in
                        Latin languages where they are only for better
                        readibility)
  -o FILE, --output=FILE
                        set output file (if unset, overwrite input file)
  -p, --pdftotext       Replace some signs generated just by PDFtotext
  -s, --strip-newpage   Strip the newpage character
  -u FILE, --userdict=FILE
                        set path to user-defined replacements/additions for
                        unicode mappings (format described in README)

Where Do The LaTeX-commands Come From / How Do I Customize Them?

The initial unicode table was downloaded from:

http://www.w3.org/Math/characters/unicode.xml

With the -u switch you can supply an additional unicode table to override (or even add) unicode points. The format is simple:

<decimal_number><tab><replacement>

Example:

123 \{

This allows you to customize LaTeX-commands. E.g. I don't like \varnothing, \emptyset seems more intuitive for me.

Make PDF's Readable

PDF's contain often ligatures and to read them painlessly with a screen reader, it is often not sufficient to just save the plain text or to copy the text from the PDF viewer. To get a usable result, one should extract the text using the poppler utils. This is a collection of command line programs which are available for GNU/Linux and Windows (see below). The text extracted by the command line program "pdftotext" can be then used as input for uTeXer and in many cases this yields a well usable result.

Download Of Poppler-Utils

Known Issues

As said before, uTeXer can not fully translate formulas. Especially formulas who are bigger than a line, e.g. a fraction, indices and powers are (often) not recognized, just because they are not marked in unicode, but by changing their relative height. This only matters for PDF output, ob web pages, people often use tags to indicate subscripts and so on.

Overline and underlines are also lost!

There are signs in the unicode table which should not be translated or are translated to not commonly used LaTeX-commands:

  • \varnothing instead of the more common \emptyset
  • { } instead of \lbrace and \rbrace, since source code is also replaced