Python and Unicode


Nathan Osman's Gravatar

Nathan Osman
published Aug. 20, 2014, 9:36 p.m.


As one of the programmers who maintains the source code for the 2buntu website, I am pleased to announce an upcoming milestone - Python 3 support. We're not quite there yet, but we're very close. You can keep track of our progress as we near the finish line on this page.

I thought it might be worth taking a few minutes to go over some of the challenges we faced during the conversion and share some tips. Virtually all of the issues we faced dealt with Unicode in one form or another, so I'll focus on that aspect.

Bytes vs. Characters

Before exploring the different ways Python handles Unicode, there is an important concept that must be explained. One must understand the difference between a sequence of bytes and a sequence of characters. If you open a text file in a text editor, you are looking at characters. The letter "a" is a single character. The Greek letter omega "ω" is a single character.

One byte consists of 8 bits, and can therefore store 256 unique values (2^8 = 256). This is enough to store the characters of the Latin alphabet. But as soon as you start adding other alphabets, it immediately becomes clear that one byte is not sufficient. Even two bytes are not sufficient.

The solution to this problem is obvious - in order to store such a vast array of characters, an encoding must be used that maps characters to multi-byte values.

UTF-8

Although many different character encodings exist, Python uses UTF-8 by default and it is arguably the most common encoding. I won't go into too many technical details here, but UTF-8 can be summed up as follows:

  • Latin characters are stored as a single byte, just like ASCII
  • if a byte has a value > 127, it is the start of a multi-byte value
  • characters may require anywhere from 1 to 4 bytes for storage

If you only take away one thing from this article, it's this:

Characters do not directly correspond with bytes.

That is to say, you cannot assume that a 1846 byte file contains 1846 characters. If the characters are entirely ASCII, then the file does indeed contain 1846 characters. But as soon as you begin storing non-ASCII characters, the file size will increase.

Python 2

Python 2 makes a clear distinction between these two types:

  • a str is a sequence of bytes
  • a unicode is a sequence of characters

Yes, I do realize that str is probably not a suitable name for a sequence of bytes. But the distinction is very important. Let's open the Python interpreter and do some exploration:

>>> type('Hello, world!')
<type 'str'>

A string literal (the text between quotes) is a sequence of bytes (confusingly named str) by default. What happens if we try to stick some Unicode in there?

>>> type('Hello, ω!')
<type 'str'>

Wait, what? How can this be? Well, let's see how long this so-called string is:

>>> len('Hello, ω!')
10

If you count the string carefully, you will only see nine characters. However, the string consists of ten bytes since omega "ω" requires two bytes of storage in the UTF-8 encoding. It's important to remember that this is not a sequence of characters but a sequence of bytes.

Let's try finding the length of a unicode string with the same text:

>>> len(u'Hello, ω!')
9

(Prefixing a string literal with "u" lets the interpreter know that you want to instantiate a Unicode string.) As we would expect, the output is 9 this time around since len() is measuring the number of characters instead of the number of bytes.

Python 3

With Python 3, things are a little bit backwards. A string literal is now Unicode by default and you must use the "b" prefix to indicate a raw sequence of bytes.

>>> type('Hello, ω!')
<class 'str'>

Remember, str in Python 3 is a Unicode string. Let's confirm this by checking the length of the string:

>>> len('Hello, ω!')
9

Whew! All is well in the universe. Let's do one last check with the "b" prefix to see if the length is 10:

>>> len(b'Hello, ω!')
  File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.

Huh. What's going on? Python 3 will not let you use non-ASCII characters in a string literal. You would need to specify the escape sequences (\xcf\x89 in this case) by hand.

>>> len(b'Hello, \xcf\x89!')
10

Doing Conversions

In Python 3, you can use the encode() and decode() method to facilitate conversion between the two types. Here's another important concept to remember:

To perform a conversion between bytes and characters, you need an encoding.

Remember that UTF-8 is one of many possible encodings (although it is arguably the most common) and a sequence of bytes has no meaning apart from an encoding. Here's the proper way to do it:

>>> 'Hello, ω!'.encode(encoding='utf8')
b'Hello, \xcf\x89!'

This gives us a sequence of bytes that represent the original string in UTF-8. The encoding parameter is optional since "utf8" is the default. However, it is always better to be explicit.

To decode a sequence of bytes back into a sequence of characters, we do the opposite:

>>> b'Hello, \xcf\x89!'.decode(encoding='utf8')
'Hello, ω!'

Implications

When opening a file in Python 3, you have the option of opening the file either in text mode or in binary mode. As you might have guessed, text mode will cause read() to return Unicode and binary mode will return a series of bytes:

>>> f = open('myfile.txt', 'r')
>>> type(f.read())
<class 'str'>

We pass "b" to open() to open the file in binary mode:

>>> f = open('myfile.txt', 'rb')
>>> type(f.read())
<class 'bytes'>

Summary

As you can see, the distinction between bytes and characters is critical when dealing with strings. Python 2 and 3 both have two different types for encoded and unencoded strings. However, the two types have different names in Python 2 and 3. Also, string literals are handled differently.

I realize this is a very complex subject, so feel free to ask a question in the comments below and I'll get back to you.