#ifndef DOXYGEN_SKIP /* $Id: RFC5_UNICODE.dox 10035 2006-09-25 14:15:40Z fwarmerdam $ */ #endif /* DOXYGEN_SKIP */ /*! \page rfc5_unicode RFC 5: Unicode support in GDAL Author: Andrey Kiselev
Contact: dron@ak4719.spb.edu
Status: Development
\section rfc5_summary Summary This document contains proposal on how to make GDAL core locale independent preserving support for native character sets. \section rfc5_main Main concepts GDAL should be modified in a way to support three following main ideas:
  1. Users work in localized environment using their native languages. That means we can not assume ASCII character set when working with string data passed to GDAL.
  2. GDAL uses UTF-8 encoding internally when working with strings.
  3. GDAL uses Unicode version of third-party APIs when it is possible.
So all strings, used in GDAL, are in UTF-8, not in plain ASCII. That means we should convert user's input from the local encoding to UTF-8 during interactive sessions. The opposite should be done for GDAL output. For example, when user passes a filename as a command-line parameter to GDAL utilities, that filename should be immediately converted to UTF-8 and only afetrwards passed to functions like GDALOpen() or OGROpen(). All functions, wich take character strings as parameters, assume UTF-8 (with except of several ones, which will do the conversion between different encodings, see \ref rfc5_implementation). The same is valid for output functions. Output functions (CPLError/CPLDebug), embedded in GDAL, should convert all strings from UTF-8 to local encoding just before printing them. Custom error handlers should be aware of UTF-8 issue and do the proper transformation of strings passed to them. The string encoding pops up again when GDAL needs to call the third-party API. UTF-8 should be converted to encoding suitable for that API. In particular, that means we should convert UTF-8 to UTF-16 before calling CreateFile() function in Windows implementation of VSIFOpenL(). Another example is a PostgreSQL API. PostgreSQL stores strings in UTF-8 encoding internally, so we should notify server that passed string is already in UTF-8 and it will be stored as is without any conversions and losses. For file format drivers the string representation should be worked out on per-driver basis. Not all file formats support non-ASCII characters. For example, various .HDR labeled rasters are just 7-bit ASCII text files and it is not a good idea to write 8-bit strings in such a files. When we need to pass strings, extracted from such file outside the driver (e.g., in SetMetadata() call), we should convert them to UTF-8. If you just want to use extracted strings internally in driver, there is no need in any conversions. In some cases the file encoding can differ from the local system encoding and we do not have a way to know the file encoding other than ask a user (for example, imagine a case when someone added a 8-bit non-ASCII string field to mentioned above plain text .HDR file). That means we can't use conversion form the local encoding to UTF-8, but from the file encoding to UTF-8. So we need a way to get file encoding in some way on per datasource basis. The natural solution of the problem is to introduce optional open parameter "ENCODING" to GDALOpen/OGROpen functions. Unfortunately, those functions do not accept options. That should be introduced in another RFC. Fortunately, tehre is no need to add encoding parameter immediately, because it is independent from the general i18n process. We can add UTF-8 support as it is defined in this RFC and add support for forcing per-datasource encoding later, when the open options will be introduced. \section rfc5_implementation Implementation \section rfc5_backward Backward Compatibility The GDAL/OGR backward compatibility will be broken by this new functionality in the way how 8-bit characters handled. Before users may rely on that all 8-bit character strings will be passed throgh the GDAL/OGR without change and will contain exact the same data all the way. Now it is only true for 7-bit ASCII and 8-bit UTF-8 encoded strings. Note, that if you used only ASCII subset with GDAL, you are not affected by these changes. \section rfc5_references References */