Unicode Support

Description

Official support for Unicode was added in the VAST 2022 (11.0.0) release to complement the existing National Language Support (NLS) feature.

Overview

There are many inherent complexities in the digital representation and transmission of the world's languages.

The concept of a character itself is a complex topic when considered across the spectrum of all languages. Even with Unicode, users still face issues with encodings, either off-the-wire or via the filesystem. Endianness in some encoded forms (like UTF-16LE or UTF-16BE) can become an issue also.

New complexity is introduced with normalization forms since there are many "user-perceived" characters that can have several codepoint representations. Ambiguity regarding what a character is and how you can access it from a string can be involved. Even the Unicode standard's usage of the term character is not consistent.

The UnicodeSupport module was created to help solve many of these issues by providing a modern String implementation that adheres to the Unicode Standard.

Unicode String Model

There are three main abstractions that have been developed to facilitate working with Unicode data which are the Unicode counterparts to the locale-based <String>/<Character>.

They are: <UnicodeScalar>, <Grapheme>, <UnicodeString>. This will be discussed in more detail below, but these components of the model logically form a tree structure. At the root is the <UnicodeString> which is composed of 0 or more <Grapheme>s (children). Each <Grapheme> is logically composed of 1 or more <UnicodeScalar>s.

This is only a logical model to help you reason about the relationships. The representation in memory is far more compact and different.

UnicodeScalar

A <UnicodeScalar> represents all Unicode code points except for a special range reserved for UTF-16 encoding.

Code points are the unique values that the Unicode consortium assigns to the Unicode code space. (The code space is the complete range of possible Unicode values.) Depending on your definition of what a character could be, a code point can represent a character. There are many other classifications for code points as well.

Since Unicode scalars are a subset of code points, many languages consider a Unicode scalar to be the native character, and accessing a string by an index value would result in a Unicode scalar.

In VAST, our <UnicodeScalar> is able to describe many properties from the Unicode Character Database, such as a scalar's name, its case, is it alphabetic, is it numeric, etc. VAST Unicode Views (described later in this post) will make it easy to access Unicode scalars from <Graphemes>and <UnicodeStrings>.

Grapheme

A <Grapheme> in VAST represents a user-perceived character and is the basic unit of a VAST <UnicodeString>.

There are only a few programming languages with Unicode Support that consider a character to represent the written expression of a character in the way a user might see it on a screen, rather than a digital code point. While the visual expression would be called a Glyph, the digital expression of this concept is called a grapheme cluster.

In VAST, we call this a <Grapheme>. The <Grapheme> is logically composed of one or more <UnicodeScalar>s. It is identified using extended grapheme cluster boundary algorithms from Text Segmentation in the Unicode Standard.

The VAST <Grapheme> will abstract the details of how to group enough Unicode scalars together to form what we would think of as a character on the screen. It also abstracts the details regarding normalization. Normalization is problematic simply because it can create multiple binary representations of what is really the same string or character. Therefore, concepts like comparison and hashing will give incorrect results, if unhandled.

However, the VAST <Grapheme> handles these details transparently. It detects and ensures a common normalized form for various operations where these differences would matter, so the user can focus on programming, not worrying about what normalization form a string is in.

The VAST <Grapheme> is the best Unicode analog to the standard Smalltalk character class. A "ü" always maps to one <Grapheme>, even though as we described, it may be logically composed of 1 or 2 Unicode scalars. We maintain the original form for various reasons and do not implicitly convert a <Grapheme>'s internals from one encoded normalization form to another.

In VAST, our <Grapheme> is also able to describe many of the same properties from the Unicode Character Database that a <UnicodeScalar> can.

UnicodeString

A <UnicodeString> in VAST is the Unicode analog to the standard <String> class and is API compatible with it.

Our <UnicodeString> is logically a sequenceable collection of <Grapheme>s and our <UnicodeString> is both mutable and growable. The standard <String> class is also mutable, but it cannot grow without making a copy of itself. This ability to grow and shrink is a natural consequence of how a container, with a contiguous and compact backing storage, would have to work given that each element is of variable-length in that backing storage. As such, a <UnicodeString>is a subclass of <AdditiveSequenceableCollection>, similar to <OrderedCollection>.

The VAST <UnicodeString> is designed to be API compatible with <EsString> and subclasses, but it is not a subclass of <EsString>. There are a few reasons for this beyond just the ability to grow. <EsString> is locale-sensitive and is expected to have subclasses with fixed-length code units. Each one of those code units is expected to create a fully composed character. (A Unicode string might not have fixed-length code units nor code units that always create a fully composed character.)

Many APIs up and down the <Collection> hierarchy, which <EsString> is a part of, have the implicit expectation that the elements are self-contained and meaningful in isolation. In other words, it isn't expected that in order to grab the first character (string at: 1), that it has to collect a variable amount of elements to form it (as the case could be in Unicode). Or that copying from one index to another may end up copying from the middle of 1 character to the middle of another character giving nonsense as a result (also a possibility in Unicode). Or that reversing the characters in a string may give the wrong answer because the interpretation of what constitutes a character does not work in reverse, or works incorrectly (again, as can happen in Unicode).

Because <UnicodeString> is using graphemes, it naturally works correctly within the context of the <Collection>hierarchy and works efficiently for a VAST user.

Views

A <UnicodeView> in VAST is a readable, positionable, and a bi-directional stream that provides a particular interpretation of the elements in a UnicodeString.

As discussed, even the concept of how developers want to work with characters is not always consistent. We have chosen the basic accessible unit of the <UnicodeString> to be a <Grapheme> (backed by an extended grapheme cluster) and that should be thought of as VAST's Unicode character. However, sometimes a user might want to view the <UnicodeString> as a collection of <UnicodeScalar>s and be able to efficiently work with those.

Our <UnicodeScalarView> offers this and is available to both a <Grapheme> and <UnicodeString>. A user might want to get the <UnicodeString>, <Grapheme> or <UnicodeScalar> in terms of UTF-8, UTF-16, or UTF-32 encoded bytes, and our <Utf8View>, <Utf16View> and <Utf32View> offer that. And, of course, our <UnicodeString> has a <GraphemeView> that gives linear performance over the contents of the <UnicodeString> and is the analog to asking a standard <String> for a #readStream.

All views keep bookmarks to the internal representation of a <UnicodeString> so it can pick up exactly where it left off in a call such as #next and #previous. The position in the stream will be in terms of whatever the basic unit of the view is. For a <GraphemeView>, these are Grapheme-based positions. For a <UnicodeScalarView>, these are UnicodeScalar-based positions. For any of the UTF views, each position is the associated code-unit position.

Views will always have a consistent view of the <UnicodeString>, even if code is making modifications to the <UnicodeString> during the usage of the view. This will be discussed in more detail below, but in short, a view is augmented by our copy-on-write feature for <UnicodeString>s.

Unlike our standard streams, views have the ability to get the previous element and stream backwards. This is really powerful, especially considering the complexity of the encoding under the hood.

Constant Time Indexing

In this case, constant-time indexing is the ability to select a character in a string with the time it takes being independent of the size of the string.

Constant time indexing is a popular topic of discussion in relation to Unicode. A character, the way VAST defines it as <Grapheme>, is variable-length either in its number of code points or the number of bytes in its encoded forms. Variable-length elements mean you can't do constant-time indexing without optimizations (which will be discussed below).

The next logical question might be: "How important is constant-time indexing with strings?" Instead, perhaps the question should be: "How important is constant-time random accessing of strings?" When considering how developers interact with strings, is it often the case that they jump around from one element to another in random order? Unless building something like a text editor, this is much less likely. Streaming through a string linearly, making copies of strings/substrings, looking for the index of an element or substring, regex, etc. is the more likely scenario.

Thankfully, we have solved most all of these issues with Unicode Views which not only provide linear-time streaming, but also give a richer position object that allows constant time indexing back into the string. We've also updated ReadStream and WriteStream with adapter classes to offer a better path for UnicodeString since the idea is that a UnicodeString would one day be a drop-in replacement for String.

In general, maintaining constant time indexing is critical. However, specific to VAST where the existing legacy algorithms expect that strings are constant time indexed, we try to maintain it. What is most important is correctness and helping the user toward that end in what is a very complex area.

Optimizations

Copy-On-Write

Every <UnicodeString> uses copy-on-write semantics, and this is implemented in the virtual machine. Copy-on-write allows multiple <UnicodeString>s to share the same underlying storage until a write occurs. If a write is about to occur and the <UnicodeString> is sharing its storage, then it first makes a copy of the storage and performs a write on that. At this point, that particular <UnicodeString> now has its own unique storage, so there would be no further need for it to perform any copies again unless another <UnicodeString> starts sharing its storage. There are very quick write-barriers in the VM that detect the locations where writes occur and if action needs to be taken.

Even if only a substring copy is being made, the storage can still be shared and we use special string slice objects to keep track of the offsets into the shared storage. But this all happens behind the scenes, so the user just works with <UnicodeString>s and doesn't care if its internals reflect a slice into other shared storage.

As hinted at earlier, views also use this mechanism when initialized on a <UnicodeString>. The <UnicodeString> is informed of this and marks its storage as shared. Now a consistent view of the original string can be guaranteed even if that string is later modified while it's being viewed.

ASCII-Optimization

The reality is that many strings will be composed of only ASCII characters. The internal storage is currently UTF-8, so if we can detect that all characters are in the ASCII range (ASCII is a subset of UTF-8), then we can mark the ASCII-flag in the object and begin to use byte-index optimized algorithms to achieve features like constant-time indexing.

There is one caveat to that which is that the Windows line terminator (/r/n) is the only grapheme in the ASCII range that is 2-bytes in length. This means if we want to use constant-time algorithms we have to prove the <UnicodeString>is both ASCII and single-scalar. This is done on string creation (as well as checks on mutations) and we also will mark the single-scalar performance flag if we can prove there is no /r/n inside. If there is, we still use faster algorithms than full grapheme-breaking, but it will be O(n).

Canonical Form

Currently, we use special algorithms to quickly determine if a string and/or an argument string are in a common normalized form. If so, we don't need to perform any normalization to ensure certain operations will be correct such as comparison, equality or hashing operations.

The <UnicodeString> has an isNFC performance flag that it marks when it finds out the string is normalized in this form. However, this happens lazily so we don't slow down string creation.

Last modified date: 04/28/2022