Python 3000, Files and Text Encodings

Guessing the Encoding of Text Files

The OQO handheld x86 Computer.


After a brief meander about the impendingness of Python 3000, this article explores ways to guess the encoding used in arbitrary text files.

You can support the Voidspace Python Projects by purchasing Movable Python, the Portable Distribution of Python for Windows.

Alternatively, for fun and interesing programs, browse over to the Voidspace Software Page.

As I'm sure you're aware, Guido has often pronounced that there will never be a Python 2.10 [1]. We're currently on Python 2.4 and momentum is building for Python 2.5. This leaves a finite, and ever dwindling, amount of time before Python 3.0 must be upon us.

In the past this has seemed such a distant possibility that it is jokingly referred to as Python 3000, or Py 3k for short. The 3000 being the year it was expected to actually arrive...

Py 3k will be an especially significant version of Python because Guido will allow it to break backwards compatibility. At last various warts and inconsistencies in the Python language will be removed. The downside is that a lot (probably most) legacy code will break, possibly never to traverse the great divide that this little version increment signifies.

A few of the major changes on the cards include :

  • Old style classes will vanish forever. All classes will then be new style classes.
  • Integer division will return floating point results. [2]
  • Print will become a function. [3]
  • All strings will be Unicode objects. Binary data will be stored using a new bytes builtin type.

Interestingly, after much argumentification, lambda will stay.

Frighteningly enough, this is no longer hypothesis and wild conjecture. Guido has recently jumped into the loving arms of google, who have freed him up to work 50% of his time on Python. This is great news. In the last couple of weeks (on Python-dev) there has been a veritable clamour of debate on the semantics and even implementation details of Python 3000. The new bytes builtin [4] is even scrambling to see if it can make it in to Python 2.5 which will happen this year.

After a long rambly introduction, it is the last change of my list that the rest of this entry is about - strings and Unicode.

Arguably, the default behaviour of Python for reading files is broken. The following basic line of code works perfectly for reading binary data or text on a Lunix type platform :

data = open('filename.dat').read()
# ...code that modifies data...
open('filename.dat', 'w').write(data)

This is such a mind-numbingly obvious idiom that it scarcely warrants mention; aside from the minor fact that if you move this code to Windoze (and you are reading binary data) it breaks.

When reading the data, any occurrences of \r\n are silently converted to \n. On writing, \n becomes \r\n. For text this is what you want (and very useful), but for binary data it corrupts it.

This of course is a minor wart, and although it bites most newbies at some point, it's easily solved by using binary mode.

data = open('filename.dat', 'rb').read()
# ...code that modifies data...
open('filename.dat', 'wb').write(data)

Nonetheless, the default mode is not binary mode and coders for the Linux platform are that bit less likely to write platform portable code. (Darn finger typing for the sake of a platform that's already broken, grumble, grumble...)

What's more serious is that probably most programmers avoid Unicode like the plague. Smile

It's much easier for us Western European and other English speaking programmers to just assume that all text is encoded in Latin-1 (or some other ASCII compatible encoding) and only ever use the str object.


Python actually makes using Unicode fairly straightforward once you understand the basic issues.

For a great introduction, may I humbly recommend A Crash Course in Character Encoding.

There is an error in the code there by the way, it uses codecs.BOM_UTF7, which of course doesn't exist. Unfortunately PyZine aren't updating any more, so it can't be fixed. Sad

This of course is a fundamentally broken approach. Surprised

If you don't explicitly know the encoding of text represented as a byte-string, then effectively you have random binary data rather than text. Unfortunately it's an approach that 'just works' most of the time, but causes horrible problems when it doesn't.

Python 3000 is going to simplify this for the majority of cases. All string instances will be unicode. Binary data (as mentioned before) will use the bytes type.

This means that when you open a file in text mode, a new I/O layer will determine the encoding and return a unicode object [5].

When you open a file in binary mode, you will get back a bytes instance.

So how does the I/O layer determine the encoding of a piece of text ? Guido says that it will do it in the same way that programs like notepad do. In other words, it will check for the existence of a UTF8/UTF16 BOM, and if this isn't present it will assume that the text is encoded in the default way for the users locale.

If you know the encoding you will be able to specify it when you open a text file.

This means you must either know the encoding of the text, or hope that Python gets the right encoding for you. Obviously knowing the encoding is preferable, but what do you do if you don't know ?

This is the problem I face in rest2web. It reads in source text files, and outputs target HTML in a consistent encoding. My source files come from a variety of places, and unless I created the text myself I rarely know the encoding used.

There are various heuristics you can use to make an educated guess at the encoding in use.

The basic technique is to check for the existence of a UTF8 or UTF16 BOM. If this is absent, you can use the locale module to determine the standard encoding(s) for the user's locale. You can then try to decode the text to Unicode, using these encodings (along with a couple of common ones) one by one. You can then assume that the first decode which works is the right one.

Here is an example function which does this (taken from rest2web, but borrowing heavily from code in the docutils project) :

import locale
import codecs
encoding_matrix = {
   codecs.BOM_UTF8: 'utf_8',
   codecs.BOM_UTF16: 'utf_16',
   codecs.BOM_UTF16BE: 'utf16_be',
   codecs.BOM_UTF16LE: 'utf16_le',
def guess_encoding(data):
    Given a byte string, guess the encoding.

    First it tries for UTF8/UTF16 BOM.

    Next it tries the standard 'UTF8', 'ISO-8859-1', and 'cp1252' encodings,
    Plus several gathered from locale information.

    The calling program *must* first call
        locale.setlocale(locale.LC_ALL, '')

    If successful it returns
        (decoded_unicode, successful_encoding)
    If unsuccessful it raises a ``UnicodeError``.

    for bom, enc in encoding_matrix.items():
        if data.startswith(bom):
            return data.decode(enc), enc
    encodings = ['ascii', 'UTF-8']
    successful_encoding = None
    except AttributeError:
    except (AttributeError, IndexError):
    except (AttributeError, IndexError):
    # latin-1
    for enc in encodings:
        if not enc:
            decoded = unicode(data, enc)
            successful_encoding = enc
        except (UnicodeError, LookupError):
    if successful_encoding is None:
         raise UnicodeError('Unable to decode input data. Tried the'
            ' following encodings: %s.' % ', '.join([repr(enc)
                for enc in encodings if enc]))
        if successful_encoding == 'ascii':
            # our default ascii encoding
            successful_encoding = 'ISO8859-1'
        return (decoded, successful_encoding)

A couple of important things to note. In order to be effective, the code calling this function must first have imported locale and called locale.setlocale(locale.LC_ALL, ''). Because of the way this effects the underlying C access to locale information, this should only ever be done once by an application. This is why it is left to the code calling this function.

Secondly, on many platforms this code will attempt to decode with several 8 bit encodings (ascii, UTF8, Latin-1, macroman and cp1252 are all common). Unfortunately, text encoded with one of these may well decode successfully with one of the other encodings. Any bytes representing non-ascii characters are likely to still be valid in another encoding, but refer to the wrong character.

This means that the above function may return the wrong eight bit encoding. sigh

There is a further trick you can use to determine which 8-bit encoding is in use. This recipe from the Python Cookbook knows which bytes commonly appear in several 8-bit encodings and will 'guess' which one is in use.


Python 3000 won't use any of these tricks, the I/O layer will just use the default encoding that the user has set. More than likely the user didn't set it, but uses the operating system default for the language chosen.

In summary, if you don't know the encoding of a text file it is a lucky dip as to whether you can correctly decode. This won't improve dramatically in Python 3000, but it will only be as bad as any other program you use.

[1]This is because he hates the ambiguity of double digit version numbers. Rightly so IMHO. A recent thread on Python-dev did suggest hexadecimal or roman numerals as a way round this. Wink
[2]Presumably, only if necessary.
[3]This fact alone could consign millions of lines of unmaintained code to the rubbish heaps of history.
[4]But not the all strings to Unicode change, although it was suggested as an optional switch.
[5]Instead of using open to open files, you may have the choice of two new builtin functions opentext or openbinary. This is still a matter of debate on Python-dev.

For buying techie books, science fiction, computer hardware or the latest gadgets: visit The Voidspace Amazon Store.

Hosted by Webfaction

Return to Top

Page rendered with rest2web the Site Builder

Last edited Tue Aug 2 00:51:34 2011.