Chapter 13—Unicode

Omnis Studio fully supports Unicode, which means you can expand the market for your Omnis applications by supporting the majority of world languages and the display of special characters, including scientific and mathematical symbols.

In previous versions of Omnis Studio, we provided a Unicode and non-Unicode version of the development kit, but from Omnis Studio 5 onwards only the Unicode compatible version was provided. The Unicode version of Omnis Studio is available for Windows, macOS, and Linux, and will allow you to localize your applications and deploy them to virtually any market, anywhere in the world.

You should also refer to the Localization chapter for information about localizing and deploying your desktop applications for non-English speaking markets.

What is Unicode?

Unicode provides a mechanism for representing characters or symbols used in many of the languages in the world, as well as scientific and technical environments. The Unicode standard is maintained by the Unicode Consortium (www.unicode.org) who set the standards for Unicode and promote its worldwide use. They define Unicode as: “a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world.” In the context of client-server and Internet-based computing, Unicode allows the seamless exchange and processing of character data across different platforms, software products and programming environments.

The Unicode consortium provides information and resources concerning Unicode, including the standard definition and maintenance, character code tables, a locale identifier repository, and lists of Unicode enabled products. The last major version update of Unicode was version 9 which is capable of representing over 100,000 different characters, used in many different languages throughout the world. Many operating systems and software products have adopted Unicode, which is now universally accepted as the standard for character representation. For example, the latest versions of Windows and macOS, as well as all varieties of Linux, support Unicode. All web standards, such as the latest versions of HTML, XML, and JSON support Unicode, as well as the latest versions of Internet Explorer and all Mozilla-based browsers. In addition, SQL databases such as the most recent versions of Sybase, Oracle, and DB2 support Unicode.

Together with the display of multiple languages in Omnis, the use of Unicode encoding affects the sort order of dynamic data, for example, in list variables, as well as the querying and retrieval of data from Unicode compatible Server databases.

DAMs

The DAMs provided with Omnis Studio (from version 5.0 onwards) are able to function in Unicode or 8-bit compatibility mode. This means that after converting your existing libraries, it is possible to continue interacting with non-Unicode databases.

In 8-bit compatibility mode, all DAMs:

Return non-Unicode character data types via the $createnames() and $coltext attributes
Bind outgoing character variables using the database's non-Unicode data types
Convert all data inside outgoing character bind variables to single-byte characters
Define incoming character columns using the database's non-Unicode data types
Convert all data inside incoming character bind variables from bytes into characters

Switching to 8-bit compatibility mode

To switch to 8-bit compatibility mode, there is a session property $unicode which should be set to kFalse from its default value of kTrue. This implementation allows multiple Unicode and 8-bit session objects to exist side by side if required.

Character Mapping

This section is applicable to session objects operating in 8-bit compatibility mode only.

When reading data from a server database, Omnis expects the character set to be the same as that used in an Omnis data file. The Omnis character set is based on the macOS extended character set, but is standard ASCII up to character code 127. Beyond this value, the data could be in any number of different formats depending on the client software that was used to enter the data.

When assigned, the $maptable session property identifies files containing translation tables for 8-bit character codes read into and sent out of Omnis. For example, suppose you are working with a database that stores EBCDIC characters. In order to accommodate this database, you should create an '.IN' map file that translates EBCDIC characters to ASCII characters when Omnis in reading server data and a matching '.OUT' file that reverses the process by converting ASCII to EBCDIC characters when Omnis is sending data to the server.

Under Windows and Linux, Omnis uses the same character set as under macOS, so in the general case, mixed platform Omnis applications should have no need for character mapping. However, if the data in a server table was created by another software package, running under Windows for example, the characters past ASCII code 127 would appear incorrect when read using Omnis. In this situation the $maptable property should be used to map the character set.

There are two kinds of character maps: IN and OUT files. IN files are used to translate characters coming from a server database into Omnis. OUT files are used to translate characters that travel from Omnis back to a server database.

The Character Map Editor

The Character map editor is accessed via the Add-On tools menu item and enables you to create character-mapping files. You can change a given character to another character by entering a numeric code for a new character. The column for the Server Character for both .IN and .OUT files may not actually represent what the character is on the server. This column is only provided as a guide. The Numeric value is the true representation in all cases.

To change a character, select a line in the list box and change the numeric code in the Server Code edit box. Once the change has been recorded, press the Update button to update the character map. You can increase/decrease the value in the Server Code edit box by pressing the button with the left and right arrows. Pressing the left arrow decreases the value, pressing the right arrow increases the value.

The File menu lets you create new character map files, save, save as, and so on. The Make Inverse Map option creates the inverse of the current map, that is, it creates an ".IN" file if the current file is an ".OUT" character map, and vice versa.

Using the Map Files

Establish the character mapping tables by setting the session property $maptable to the path of the two map files. Both files must have the same name but with the extensions .IN and .OUT and be located in the same folder. The $maptable property establishes both .IN and .OUT files at the same time. For example:

Do SessObj.$maptable.$assign('C:\Program Files\Omnis Software\ Charmaps\pubs') Returns #F

In this example, the two map files are called "pubs.in" and "pubs.out".

The session property $charmap controls the mode of character mapping that is to be applied to the data. Set the character mapping mode using a command of the form:

Do SessObj.$charmap.$assign(pCharMap) Returns #F

The potential values for the character mapping mode parameter pCharMap are:

kSessionCharMapOmnis
Use the internal Omnis character set.
kSessionCharMapNative
This is the default and specifies that the client machine character set is to be used.
kSessionCharMapTable
Use the character mapping table specified in the $maptable property. If the $maptable property is not set and the application attempts to assign kSessionCharMapTable this fails.

If you wish to use the character mapping tables defined using the $maptable property, you must set $charmap to kSessionCharMapTable.

Interpreting 8-bit Data

This section is applicable to the MySQL, PostgreSQL and Openbase DAMs which interface with their respective client libraries using the UTF-8 encoding.

When operating in Unicode mode, it is possible to receive mixed 8-bit and Unicode data, since UTF-8 character codes 0x00 to 0x7F are identical to ASCII character codes.

Where this data was created using the non-Unicode version of Omnis however, it is possible that the data may contain ASCII extended characters. In this case, the Unicode DAM will encounter decoding errors, mistaking the extended characters as UTF-8 encoded bytes.

This issue was not a concern for the non-Unicode version of Omnis Studio since extended characters were always read and written as bytes, irrespective of the database encoding.

In order to avoid problems when upgrading to the Unicode version of Omnis Studio, it is advisable to convert tables containing ASCII extended characters to UTF-8. This process is simplified where the database character set is already set to UTF-8 (as is often the case with MySQL). All that is required is to read and update each row in the table and repeat this for all tables used by the application. In so doing, Omnis will convert the 8-bit data to Unicode and then write the converted Unicode data back to the database.

In order to facilitate this within the DAM, the session property $validateutf8 is provided. When set to kTrue (the default), any fetched character data is validated using the rules for UTF-8 encoding. Where a given text buffer fails validation, it is assumed to be non-Unicode data and is interpreted accordingly. When written back to the database, all character data will be converted to UTF-8. Such updates will result in frequently accessed records having their contents refreshed automatically.

By setting $validateutf8 to kFalse, validation is skipped and the DAM reverts to the previous behavior, in which case extended ASCII characters should be avoided.

Aside from the issue of UTF-8 encoded data, the DAMs provided with Studio 5.0 are able to retrieve non-Unicode data from non-Unicode database columns in either Unicode or 8-bit compatibility mode. The DAM knows the text capabilities of each character data type and assigns encoding values to each result column accordingly.

The difference in behavior when using 8-bit compatibility is that in compatibility mode, it is also possible to write data back to non-Unicode columns.

In Unicode mode, the DAM assumes that it will be writing to Unicode compatible data types and this will cause data insertion/encoding mismatch errors if the clientware tries to insert into non-Unicode database columns.

Character Mapping in Unicode Mode

Character mapping to and from the Omnis character set is also possible where session objects are operating in Unicode mode. This was previously removed from the Unicode DAMs since it provided compatibility between the various 8-bit character sets. Where Unicode DAMs encounter 8-bit data however, it is necessary to indicate the character set used by the data. For this reason the session $charmap property can be used to indicate that fetched 8-bit data uses either:

kSessionCharMapRoman
Use the Mac Roman character set to interpret the characters
kSessionCharMapLatin1
Use the Windows/Linux character set to interpret the characters

Fetching Data to a File

The $fetchtofile() method has the iEncoding parameter, as follows:

Do StatementObj.$fetchtofile(cFilename [,iRowCount=1] [,bAppend=kTrue] [,bColumnNames=kTrue] [,iEncoding=kUniTypeUTF8/kUniTypeLatin1])

where iEncoding is an optional parameter specifying the type of encoding to be used. It should be one of the Unicode type constants and defaults to kUniTypeUTF8. The corresponding Unicode Byte Order Marker (BOM) is written to the beginning of the file when the file is empty or when bAppend is set to kFalse.

Server Specific Programming

Certain DAMs, namely DAMORA8 and DAMODBC, also provide session properties which allow mixing of Unicode and 8-bit data when the DAM is operating in Unicode mode.

Oracle DAM

This section summarizes recent changes made to the Unicode Oracle Object DAM designed to enable insertion and retrieval of mixed ANSI and Unicode character types.

In the case of Oracle 8i and later, these data types are:

Type	Description
CHAR	Fixed single-byte character data, limited to 2000 bytes.
NCHAR	Fixed multi-byte character data, limited to 2000 bytes. (1000 UCS-2 encoded characters)
VARCHAR2	Varying length, single-byte character data, limited to 4000 bytes.
NVARCHAR2	Varying length, multi-byte character data, limited to 4000 bytes. (2000 UCS-2 encoded characters)
CLOB	Character Large Object- single-byte character data.
NCLOB	National Character Large Object- multi-byte character data.
LONG	Varying length, single-byte character data, limited to 2GB. Supported for backward compatibility only.

By default, the Unicode Oracle DAM maps all Omnis character data to the NVARCHAR2 and NCLOB data types, dependent on the field length of the Omnis bind variable. However, the Oracle DAM provides session properties which affect the Omnis to Oracle data type mappings:

$nationaltonvarchar
If set to kTrue, Character and National data types are treated differently when being inserted to VARCHAR2 / NVARCHAR2 columns. The National character subtype will be used with Unicode data, whilst the Character subtype will be reserved for non-Unicode data.
$nationaltonclob
If set to kTrue, large Character and National data types are treated differently when being inserted to CLOB / NCLOB columns. The onus is upon the developer not to put Unicode characters into Character subtypes when using these properties; otherwise data insertion/encoding mismatch errors will occur.
$maxvarchar2
Sets the byte limit above which Omnis character fields will be mapped to CLOB/NCLOB data types as opposed to VARCHAR2 / NVARCHAR2 columns. The maximum value is 4000 bytes.
$longchartoclob
If set to kTrue (the default), Omnis large character fields > $maxvarchar2 in byte length will be mapped to the CLOB/NCLOB data type. If set to kFalse, the LONG data type is used.

Reading Unicode and Non-Unicode Data

The Oracle DAM automatically detects the data type of retrieved character columns and converts the data accordingly. There is no need to modify any properties in order to retrieve mixed ANSI and/or Unicode Data.

ODBC DAM

The ODBC DAM provides the $nationaltowchar session property.

By default, Omnis Character and National fields are mapped to the SQL_WCHAR, SQL_WVARCHAR and SQL_WLONGVARCHAR data types. By setting $nationaltowchar to kTrue only National fields will be mapped to these types (to the equivalent server data types) and Character fields will be mapped to SQL_CHAR, SQL_VARCHAR and SQL_LONGVARCHAR as determined by the Omnis field length. Character fields mapped in this way are subject to data loss/truncation where such fields contain Unicode characters. When setting this property, please note that Unicode data types usually have precision limits half that of their corresponding ANSI data types. For example, this is 8000 for the SQL Server VARCHAR() data type but 4000 for NVARCHAR(). $nationaltowchar affects both the text returned by the $createnames() method and the binding of input parameters.

Character Normalization

Originally, Unicode was a 16-bit character set. It has subsequently been extended to include code point values up to and including U+10FFFF. It is not expected that it will be extended any further. Windows and macOS still represent Unicode character strings using arrays of Short (16-bit) integers. This is not a problem, because the UTF-16 standard allows code points U+10000 and greater to be represented by pairs of 16-bit values (each member of the pair occupies space in the 16-bit range that is not used for code points). This representation is referred to as a surrogate pair.

Internally Omnis uses UTF-32 to represent code points, that is, each code point occupies 32 bits, and the value of each code point is between 0 and U+10FFFF inclusive. This allows for straightforward processing of character strings, since every code point occupies the same space in memory.

Unicode allows a significant number of characters to be represented by more than one sequence of code points. For example, consider the letter E with circumflex and dot below, a character that occurs in Vietnamese (Ệ). This character has five possible representations in Unicode:

U+0045 Latin capital letter E
U+0302 combining circumflex accent
U+0323 combining dot below
U+0045 Latin capital letter E
U+0323 combining dot below
U+0302 combining circumflex accent
U+00CA Latin capital letter E with circumflex
U+0323 combining dot below
U+1EB8 Latin capital letter E with dot below
U+0302 combining circumflex accent
U+1EC6 Latin capital letter E with circumflex and dot below

A character represented by more than one code point is referred to as a composite character. A character represented by a single code point is referred to as a pre-composed character.

As far as the end-user is concerned each of these representations usually needs to be treated identically. This leads to some interesting consequences for Omnis. These are discussed in the following sections. Note the term end-user character means the character that the end-user is working with – in the example above, the end-user character is Ệ.

Normalization of a Unicode character string converts the string into a standard, defined format. Once normalized, a Unicode character string has only one possible representation, thereby making it possible to compare it with other character strings, and produce results useful to the end-user. The Unicode standard recommends two forms of normalization. These are:

Canonical decomposition, referred to as NFD:
Pre-composed characters are replaced by their equivalent composite characters; Composite characters are replaced with a single fixed composite representation.
Canonical decomposition followed by canonical composition, referred to as NFC:
After carrying out NFD, all composite characters are replaced with their pre-composed equivalent, where one exists.

Omnis provides two functions to normalize character strings:

nfd(string) carries out canonical decomposition on the string and returns the normalized string.
nfc(string) carries out canonical decomposition followed by canonical composition on the string and returns the normalized string.

These functions are not available in client-side web client methods.

Comparing Text

Omnis uses two types of comparison for character strings:

Comparison of the UTF-8 values of the strings. This is called Character comparison.
Comparison according to the rules for the locale specified via the localization data file; prior to comparison, the input data is normalized. This is called National comparison. National comparison is more likely to produce results that the end-user would expect. Note that upper casing used in conjunction with national comparison may not have an effect, since sometimes the rules for the locale ignore the case of the characters.

The natcmp() function uses national comparison. Note that natcmp() is not available in client-side web client methods.

Omnis compares text for many different reasons, and in many different places. Key areas are:

Sorting lists
Searching lists
Manipulating data file indexes
Expressions, for example the test on an if statement

Omnis supports two types of character variable – character and national.

Sorting Lists

When using the character type, Omnis uses character comparison.

When using the national type, Omnis uses national comparison.

Searching Lists

When using the character type, Omnis uses character comparison.

Searches that directly use a character column of national type use national comparison.

Other searches, for example searches using a calculation, will behave as if they are operating on normal character data. However, you can use natcmp() as part of the calculation, in order to use national comparison.

Manipulating Data File Indexes

Indexes for national fields use national comparison.

Expressions

To ensure the correct behavior of expressions that test the value of character variables, you must either normalize their value first using nfc() or nfd(), or you must use the natcmp() function.

Drawing Text

Depending on the font and operating system you use, different representations of the same end-user character may not always be drawn in the same way. The same applies if you try to use strings that require surrogate pairs. Generally speaking, you will get the best results if you normalize the text using nfc(), as the issues generally occur with composite characters.

Entering Text

Wherever possible, you should use the nfc() normalization form for data that is to be edited. If composite characters are present in the data, multiple left or right arrow key presses are required to skip a composite character, and also clicking and selecting in the text will highlight an area which when copied to the clipboard might not exactly contain what appeared to be highlighted.

Omnis performs NFC normalization on character data pasted from the clipboard when running in the thick client (runtime); no normalization occurs when pasting characters into a remote form when using the web client.

Character Translation

The following functions allow you to translate a specified character in a string to its Unicode value and to allow the reverse.

unicode(string,position[,returnhex])
returns the Unicode value of the character at the specified position in the string. The first position in string is 1. If Boolean returnhex is true (default false) it returns a hex string representing the value, of the form 'U+h'.
unichr(num1[,num2]...)
returns a string formed by concatenating the supplied Unicode character codes. Each code is either a number or a string of the form 'U+h',where h is 1-6 characters representing a hexadecimal value.

These functions are available in client-side methods as well as the thick client, but will generate an error if used in the non-Unicode version of Omnis.

Unicode Clients

Locale Identifier

The locale()function returns the Locale Identifier (LCID) for the current client machine/operating system. As well as the language of the machine, the Locale Identifier specifies the decimal, thousand and list separators, currency values, units of measurement, date formats, and character sort order. The Locale is specified at the operating system level and is in the form language_country, where language is the ISO639 language name, and country is the ISO3166 country name. For example, the Locale for the UK is “en_GB”. On macOS, there may be other information, such as a script code, between the language and country (this is because macOS uses ICU locales).

Unicode Data Handling

The uniconv() function allows you to translate Unicode character data from one type to another. The syntax is:

uniconv(srctype,src,dsttype,dst,bom,errtext)

The function converts src, and stores the result in dst. It returns zero for success, or a non-zero error code together with error text in errtext. Src and dst are either binary or character variables, depending on the values of the srctype and dsttype.

srctype and dsttype are one of the kUniType... constants (see below).

Bom is Boolean: if true, dst has a Unicode Byte Order Marker (BOM) if relevant for the destination type.

The kUniType... constants are as follows:

kUniTypeAuto
The source encoding is automatically detected from the conversion source; possible encodings are identified by the remaining kUniType... constants (allowed only for the source type).
kUniTypeUTF8
The data is stored in a binary variable and contains Unicode character data encoded using UTF-8
kUniTypeUTF16
The data is stored in a binary variable and contains Unicode character data encoded using UTF-16LE if the machine is little-endian, or UTF-16BE if the machine is big-endian. Useful when writing cross-platform code that interacts with the OS.
kUniTypeUTF16BE or kUniTypeUTF16LE
The data is stored in a binary variable and contains Unicode character data encoded using UTF-16BE (big-endian) or UTF-16LE (little-endian)
kUniTypeUTF32
The data is stored in a binary variable and contains Unicode character data encoded using UTF-32LE if the machine is little-endian, or UTF-32BE if the machine is big-endian. Useful when writing cross-platform code that interacts with the OS.
kUniTypeUTF32BE or kUniTypeUTF32LE
The data is stored in a binary variable and contains Unicode character data encoded using UTF-32BE (big-endian) or UTF-32LE (little-endian)
kUniTypeNativeCharacters
The data is stored in a binary variable and contains a stream of bytes, where each byte is a character in the Latin 1 character set for the machine (Ansi on Windows, MacRoman on macOS, ISO-8859-1 on Unix
kUniTypeCharacter
The data is stored in a character variable. Note – this constant has been moved since the last Unicode build, so you need to re-enter it in your code.
kUniTypeAnsi…
The data is stored in a binary variable, and contains character data where each byte is encoded using the specified ANSI code page. A range of constants are provided to cater for most world or regional languages, including Cyrillic, Greek, Hebrew, Arabic, Thai, and so on
kUniTypeISO8859...
The data is stored in a binary variable, and contains character data where each byte is encoded using the specified ISO 8859 code page.

There are two sys() functions to assist OEM conversion when using the uniconv() function.

sys(218) modifies OEM conversion to map CR to CR and LF to LF.
sys(219) reverts to the original mapping for the OEM code page.

Formfile

The $filereadencoding and $filewriteencoding properties have been changed. In previous versions of Omnis Studio, the Formfile component defined kFFEncoding… constants. These constants should not now be used, and you are advised to use the kUniType… constants to identify the file encoding. Formfile has been extended, so that you can use any of the kUniType... constants except kUniTypeCharacter for the $filereadencoding property, and any of the kUniType... constants except kUniTypeAuto and kUniTypeCharacter for the $filewriteencoding property.

In addition, there is also a kUniTypeBinary constant to identify files that are to be treated as raw binary data.

Code that uses the old kFFEncoding… constants should continue to work.

Fileops

The Fileops component has two methods, $readcharacter() and $writecharacter() which allow you to read and write Unicode character data from and to a file.

$readcharacter(encoding,variable)
reads all data from a file containing character data into variable; encoding is one of the kUniType… constants (listed above), identifying the encoding of the file.
$writecharacter(encoding,variable)
replaces the contents of the file with the character data stored in variable; encoding is one of the kUniType… constants, identifying the encoding of the file.

For $readcharacter, specify the encoding as any kUniType... constant except kUniTypeBinary and kUniTypeCharacter.

For $writecharacter, specify the encoding as any kUniType... constant except kUniTypeAuto, kUniTypeBinary and kUniTypeCharacter.

Note the $readcharacter() and $writecharacter() methods use the kUniType… constants and not the kFFEncoding… constants which should not now be used.

Mixing Char & Binary data

You cannot concatenate a Character variable to a Binary in the Unicode version of Omnis Studio. The correct method is to use $readfile to read the file into a Binary variable, and then parse the binary variable. Assigning Character to Binary and vice-versa is likely to cause problems, including data corruption, and should therefore be avoided.

Import/Export and Report File Encoding

There are a number of Omnis Prefences ($root.$prefs) that control the encoding of import text files, export files, and report data written to text files and the port. These are:

$importencoding
The encoding used for imported data when importing from port, or when the import file does not have a Unicode Byte Order Marker (BOM). Any of the kUniType... constants, except kUniTypeAuto, kUniTypeCharacter, kUniTypeBinary and the kUniTypeUTF32… values.
$exportencoding
The encoding used for exporting data and printing to port or text file. Any of the kUniType... constants, except kUniTypeAuto, kUniTypeCharacter and kUniTypeBinary.
$exportbom
If true, and the $exportencoding preference identifies a Unicode encoding, a Unicode BOM is output at the start of the output file.

The default value of the $importencoding and $exportencoding is kUniTypeUTF8, but you can set them using the Preferences option in the Options menu in the bottom-left corner of the Studio Browser. You can can also set the corresponding “importencoding” and “exportencoding” items in the “prefs” group in the Omnis configuration file (config.json) using the Edit configuration option in the same menu.

In a multi-threaded server, there is a separate value of each of these properties for each thread.

Omnis Data File Conversion

Omnis datafiles are supported for backwards compatibility only in legacy Omnis applications, and therefore they should not be used for new applications.

WARNING: YOU SHOULD MAKE A SECURE BACKUP OF YOUR OMNIS DATA FILES BEFORE CONVERTING THEM IN THE UNICODE VERSION OF OMNIS STUDIO (note all versions after Omnis Studio 5 are Unicode based and will convert Omnis datafiles to Unicode automatically).

When you access an Omnis data file you are asked to confirm that you want to convert the data. After you select Yes, Omnis displays a dialog which offers two types of conversion:

Full
whereby a full conversion of the Character based data in you Omnis data file takes place. The existing indexes are dropped and a new index of your data is built
Quick
whereby the indexes are dropped and rebuilt, but the Character data in you Omnis data file is not converted. This is OK for files containing only 7 bit data: Omnis does not check that the file contains only 7 bit data, so it’s your responsibility to know whether or not it is safe to run this conversion process.

The full data file conversion mechanism converts the data in your Omnis data file and rebuilds the indexes. When data file conversion takes place, all data marked as Character is converted, including any characters >= 128. Note that in the case where character data is stored in a binary or external file, for example, text stored in a document file, conversion of this data does not take place.

Testing Data File Conversion

Omnis Studio can perform a full conversion of Omnis data files to Unicode, as described above. If this is the first time you have used the Unicode version of Omnis, we suggest that you make a secure copy/backup of your Omnis data file and convert one of the copies using the ‘Full’ conversion mechanism. We suggest that you check the results of the full conversion carefully, making sure that the Character data has converted successfully and that the indexes have been rebuilt successfully.

You may want to perform some regression tests on your application and data – you should normally do this with a new version of Studio, but when converting to Unicode Omnis, and converting your data files, you need to be especially sensitive to possible data file and indexing issues.

Data File Commands

The Open data file and Prompt for data file commands have an existing option called “Convert without user prompts”. If this is checked, and the new “Full Unicode conversion” option is checked, no dialogs are displayed and your data is converted to Unicode using the Full conversion process.