ODBC and Unicode in Linux

August 10, 2012 – 5:53 pm by akojima

Recently I’ve had to spend countless hours trying to debug a problem mixing various ODBC components. Since I couldn’t find any answers online, at least not in a way that was easy for a ODBC beginner like me to understand, I’ve decided to write it down here so it can maybe save some frustration to other people.

ODBC in Linux

ODBC Managers

An ODBC manager is a library that’s responsible for interfacing between a ODBC driver and the application. The app tells the ODBC manager to open a connection and run queries; the manager loads the correct driver and forwards the ODBC API calls to the driver and the driver talks to the server, which does whatever the app told it to.

There are two ODBC manager libraries in Linux. unixODBC and iODBC. Both seem to work well, but unixODBC is somehow more popular, while iODBC is shipped as part of Mac OS X. We picked iODBC.

ODBC for Python

If you want to use ODBC from Python, you’ll need a module that does the role of the ODBC manager. Most will be just wrappers over a ODBC manager written in C, meaning either unixODBC or iODBC. We picked pyodbc.

Compiling pyodbc against iODBC is not very straightforward, because it just assumes everyone uses unixODBC in Linux (although it works just fine with iODBC, since that’s what is assumed when it’s compiled for the Mac). I won’t go into details, but the 2 things that need to be done to compile pyodbc with iODBC are:

  • edit setup.py so that the settings['libraries'].append(‘odbc’) line for linux is changed to settings['libraries'].append(‘iodbc’), making pyodbc link to iodbc
  • set the CFLAGS and LDFLAGS environment variables to your iODBC install paths and run setup.py. Example:

CFLAGS=-I/usr/include/libiodbc LDFLAGS=-L/usr/lib python setup.py install

Connecting to MS SQL Server (FreeTDS)

FreeTDS is a free driver/library that talks the Sybase and the MS SQL Server protocols. Apparently you can use it directly or you can use it as a ODBC driver.

As with pyodbc, it is probably a good idea to compile it against the driver manager you’re using.

The big wchar_t mess

In ODBC, there are 2 string types: SQL_CHAR and SQL_WCHAR. The 1st is just a plain char* array representing a plain string string. The 2nd is a Unicode string.

In Windows, MS chose to hardcode the type of SQL_WCHAR to unsigned short (2 bytes per char). The driver assumes 2 bytes per char, the ODBC manager assumes 2 bytes per char and the app assumes 2 bytes per char, so everybody understands each other because they’re talking in 2 bytes per char.

In Linux, the default is to use wchar_t. wchar_t is 4 bytes.

By default, unixODBC uses unsigned short, like Windows. But you can also compile it with SQL_WCHART_CONVERT, which will define SQL_WCHAR as wchar_t instead.

iODBC uses wchar_t always.

You can see the mess brewing up, but there’s more.

FreeTDS uses 2 bytes per char, like Windows. No matter what the driver manager is used. So by now, you can see that every component of the ODBC stack can use either 2 bytes per char or 4 bytes per char depending on how you setup everything. It may not seem like a big deal, but if you’re a ODBC newbie and have no idea about how ODBC handles Unicode data and just assumed everything would just play together nicely and in the same way (because ODBC is supposed to be a standard after all), then you’re set up for the same countless hours of hairpulling and sleepless nights as me; trying to figure out why is the Unicode data from the server presented by Python as a series of characters with 0s sprinkled between each char or just fails with some MemoryError exception.

If you use FreeTDS with unixODBC the pair will work OK, because both use 2 bytes per char by default. But if you need to use pyodbc, then it won’t, because pyodbc assumes the ODBC manager talks wchar_t. That will cause Unicode resultset strings from the server to be displayed garbled, as if it had forgotten to decode from UCS-2. Or just crash, depending whether you’re running 32 or 64bits.

If you use FreeTDS with iODBC it will not work, because FreeTDS sends data as unsigned short and iODBC sends the data to Python as wchar_t. That will cause some error in Python and you’ll get some MemoryError exception, which says nothing at all about what is the prolem.

So by default, there’s no way that things will work out OK, except in some some few lucky combinations.

BUT it seems the FreeTDS developers realized that just assuming 2 bytes per char for SQL_WCHAR is not something that’s good enough for everyone and added “experimental” support for properly using wchar_t. Unfortunately that is not the default and there’s no auto-detection either. So if you’re having to use FreeTDS with iODBC (or unixODBC configured for SQL_WCHART_CONVERT), you’ll need to use the magic –enable-odbc-wide configure option when compiling it. Also, that only actually works in version 0.92, 0.91 apparently has the option too, but it doesn’t work correctly. Then everything will finally talk in wchar_t from end to end and you get to sleep without worrying about the deadline in the next day.

 

 

Alfredo Kojima

  1. 2 Responses to “ODBC and Unicode in Linux”

  2. Small correction: Recent versions of pyodbc examines the size of SQL_WCHAR and convert data dynamically, not just assume wchar_t.

    Apparently it isn’t working correctly, but I’ll look into it.

    By mkleehammer on Sep 27, 2012

  3. Thanks for the Post. I found after struggling a hole day and reading your post that pyodbc-IODBC will work with Vertica, in contrast to pyodbc-unixODBC that will send out garbled hex strings.

    Greetings from Chile.

    By Pablo on Mar 18, 2014

Post a Comment


+ 2 = eight