Introducing UTF-8 support in SQL Server 2019 preview

December 18, 2018

Topic
High-performance database

With the first public preview of SQL Server 2019, we announced support for the widely used UTF-8 character encoding as an import or export encoding, and as database-level or column-level collation for string data. This is an asset for companies extending their businesses to a global scale, where the requirement of providing global multilingual database applications and services is critical to meet customer demands, and specific market regulations. The benefits of introducing UTF-8 support extend to scenarios where legacy applications require internationalization and use inline queries: the amount of changes and testing involved to convert an application and underlying database to UTF-16 can be costly, by requiring complex string processing logic that affect application performance.

To limit the amount of changes required for the above scenarios, UTF-8 is enabled in existing the data types CHAR and VARCHAR. String data is automatically encoded to UTF-8 when creating or changing an object’s collation to a collation with the “UTF8” suffix, for example from LATIN1_GENERAL_100_CI_AS_SC to LATIN1_GENERAL_100_CI_AS_SC_UTF8. Refer to Set or Change the Database Collation and Set or Change the Column Collation for more details on how to perform those changes. NCHAR and NVARCHAR remain unchanged and only allow UTF-16 encoding.

UTF-8 is only available to Windows collations that support supplementary characters, as introduced in SQL Server 2012. You can see all available UTF-8 collations by executing the bellow command in your SQL Server 2019 CTP:

SELECT Name, Description FROM fn_helpcollations() 
WHERE Name like '%UTF8';

Additionally, if your dataset uses primarily Latin characters, significant storage savings may also be achieved as compared to UTF-16 data types. For example, changing an existing column data type from NCHAR(10) to CHAR(10) using an UTF-8 enabled collation, translates into nearly 50 percent reduction in storage requirements. This is because NCHAR(10) requires 22 bytes for storage, whereas CHAR(10) requires 12 bytes for the same Unicode string.