Preventing encoding-related issues
Encoding-related problems are some of the most frequent problems that occur when people start with a fresh PostgreSQL setup. In PostgreSQL, every database in your instance has a specific encoding. One database might be en_US@UTF-8
, while some other database might have been created as de_AT@UTF-8
(which denotes German as it is used in Austria).
To figure out which encodings your database system is using, try to run psql -l
from your Unix shell. What you will get is a list of all databases in the instance that include those encodings.
So where can we actually expect trouble? Once a database has been created, many people would want to load data into the system. Let's assume that you are loading data into the aUTF-8
database. However, the data you are loading contains some ASCII characters such as ä
, ö
, and so on. The ASCII code for ö
is 148. Binary 148 is not a valid Unicode character. In Unicode, U+00F6 is needed. Boom! Your import will fail and PostgreSQL will error out.
If you are planning to load data into a new database, ensure that the encoding or character set of the data is the same as that of the database. Otherwise, you may face ugly surprises.
To create a database using the correct locale, check out the syntax of CREATE DATABASE
:
test=# \h CREATE DATABASE Command: CREATE DATABASE Description: create a new database Syntax: CREATE DATABASE name [ [ WITH ] [ OWNER [=] user_name ] [ TEMPLATE [=] template ] [ ENCODING [=] encoding ] [ LC_COLLATE [=] lc_collate ] [ LC_CTYPE [=] lc_ctype ] [ TABLESPACE [=] tablespace_name ] [ CONNECTION LIMIT [=] connlimit ] ]
ENCODING
and the LC*
settings are used here to define the proper encoding for your new database.