Skip to content

Latest commit

 

History

History
150 lines (127 loc) · 8.25 KB

HACKING.org

File metadata and controls

150 lines (127 loc) · 8.25 KB

HACKING for Recode

Build dependencies

m4, GNU Make, libtool, Autoconf, automake, gettext, help2man (built with gettext), Perl, Flex, Python, Cython, tar and wget.

Building from git

git clone https://github.com/rrthomas/recode.git
cd recode
./bootstrap
./configure
make check
make install

Making a release

To make a release, you’ll need woger and github-release, suitably configured. Having tested and pushed all the changes, update the version number in configure.ac and write the NEWS entry and push those too.

Then execute: make release

The future

Motivation

Recode is due for a major overhaul. I want to add a run-time dependency between Recode and Python, with the admitted goal of shifting the internals of Recode from C to Python.

For experimenting what Recode might become and experimenting new concepts more easily, I created a subsidiary and standalone Python project named Recodec, which reproduces a good part of Recode functionality. My goal is now to merge Recodec back into Recode, rather than slowly stretching the distance between Recode and Recodec. Recode is going to be a mix of Python, C and Cython.

Overall plan

Recode 4 should be organised thus:

  • The main program is written in Python, and through a Cython interface, calls the existing C API for doing the real work.
  • The C API gets merely able to use Cython written steps internally, besides the actual C steps, but with no Cython steps yet.
  • New Cython steps wrap many standard Python codecs, with some trickery to force Python codecs over actual, older Recode steps.
  • Recode library initialization is moved from C to Python, and gets called through Cython from the C API.
  • Initialization is extended to cover the Recodec Python API, which uses different tables and descriptive data.
  • More steps from Recodec get moved into Recode, either coexisting with or taking over the previous wrapping of Python codecs.
  • The remaining code from the Recodec engine gets moved into Recode, replacing C code having the same fonctionality.
  • Special care is given to GNU libc or libiconv support, maybe going from the C side to the Python side.
  • Proper documentation and decisions follow extensive comparison and diagnostic of multiple implementations of same charsets or surfaces.
  • Profiling allows to fine tune when and how Cython gets used over Python; standard Python codecs might even be cythonized in Recode.
  • Program and library initialization get revised to spare disk accesses and building descriptive structures, whenever possible.
  • The main program directly links to the Python API rather than through the C API, while the C API becomes a separate facility.

Planned differences

Whenever the Python library offers a charset or a surface which Recode also has, the Python library codec is used. In some cases, this introduces differences, those will have to be resolved one by one, either by accepting that the Python library does better, getting the Python team to improve some codecs, or overriding these from Recodec.

Other differences may occur, especially in the Asian charset area, from the fact libiconv, GNU libc recoding facilities, and various contributors to the Python codecs project, do not fully agree on how things should be done. Recodec is likely to offer configuration mechanisms to choose among various possibilities, but will not likely attempt to rule out who is right and who is wrong! ☺

Issues about reversibility and canonicity, which were much present in Recode 3.X, are fading out. While some of these were moderately easy to implement, other cases stayed pending as fairly difficult to solve without a significant loss of efficiency. I think these issues are better abandoned than forever kept as half-hearted and not wholly dependable. Any user concerned about such things might try the reverse coding to find out if the original file is recoverable, some new option might automate a (costly) reversibility test.

One drawback of the whole move is that the Global Interpreter Lock in Python gets in the way of parallel execution of the code. This would have been more of a concern if GNU libc recoding facilities were relying on the Recode library, but as things stand by now, I’m guessing that users will not be much impacted in practice.

Other pointers

Documentation

  • IETF references
  • Various references
    • Unicode charset mappings. The Unicode consortium makes available plenty of charset mappings for converting legacy charsets to Unicode.
    • Normalisation et internationalisation: Inventaire et prospectives des normes clefs pour le traitement informatique du français. (392p.) or this other copy. This is a report, written in French, discussing charset issues and many other topics as well. Laurent Bourbeau and François Pinard, 1995-10.
  • Recode specific
    • ETL presentation

      In 1999, the organisers of the m17n99 conference in Tsukuba, Japan, were kind enough to invite me. This has been for me a fabulous trip and experience, and I met many extraordinary people in there. At the conference, I presented the Translation Project, and Recode. The Recode presentation slides are available.

Programs

libiconv
This comprehensive charset converter library, by Bruno Haible, revolves around Unicode, and support Asian encodings among many others. Even Recode uses it!
tcs
Here is the main recoding tool from the Plan9 project.
yuedit
This GUI editor, by Gaspar Sinai, 1999-01, handles many encodings, among which UTF-8. It also installs uniconv, a recoding program, and uniprint, a printing tool.
ucs-fonts
These 6x13 fonts, by Markus Kuhn, 1998-11, covering Unicode characters besides the Asian sets, merely replace the Linux fixed 6x13 font. Works nicely with yudit.
MtRecode
This charset converter is oriented towards SGML text manipulation. It may be freely downloaded for non-commercial, non-military use. Pointer given by Jean Véronis, 1996-06.
sp
This quite nice SGML structure analyser, by James Clark, contains internal C++ modules for handling many charsets.
b2c
This program, by Jörg Heitkötter, 1997-11, is able to generate interpreted character dumps, but properly embedded within complete C header files.
PyRecode
This wrapper, by Andreas Jung, provides Recode functionality to Python programs. Also see this link and this other link.