Programmer Reference : UnicodeSupport : EsUnicodeEncoding
EsUnicodeEncoding
Description
Abstract container class whose subclasses implement unicode encoding standards.
Responsibility
  • Collection class for associated view
    • (#contents)
    • 'smalltalk' asUnicodeString utf8 contents isKindOf: Utf8
  • Conversion to a UnicodeString
    • (#asUnicodeString, #asUnicodeString:)
    • (Utf32LE with: 16r1F600) asUnicodeString first name = 'GRINNING FACE'
  • Conversion to any other unicode encoding container class
    • (#asUtf8, #asUtf16, #asUtf16LE, #asUtf16BE, #asUtf32, #asUtf32LE, #asUtf32BE...)
    • (Utf32 with: 16r1F600) asUtf16BE asUtf32LE asUtf16LE asUtf8 asUtf32 = (Utf32 with: 16r1F600)
  • Validate content according to unicode encoding rules of the container class
    • (#isValid)
    • 'a' asUnicodeString utf8 contents isValid.
    • (Utf8 with: 233) isInvalid
  • Container class that can be iterated in code unit slot sizes
    • (<Utf8> is byte class, <Utf16> is word class, <Utf32> is long class)
    • 16r1F600 asUnicodeString utf16 contents size = 2
Examples
Convert utf8 -> utf16BE -> utf32LE -> utf8
| utf8 |

utf8 := 'Smalltalk' utf8 contents asUtf16BE asUtf32LE asUtf8.
self assert: [utf8 asUnicodeString asSBString = 'Smalltalk']
Detect validity of UTF-16LE
"Valid UTF-16LE"
self assert: [(Utf16LE with: 16r97) isValid].

"Invalid UTF-16LE (Surrogate range)"
self assert: [(Utf16LE with: 16rD800) isInvalid].
Repair and convert invalid UTF-32BE to a UnicodeString
| invalidUtf32BE repairedUniStr |

invalidUtf32BE := Utf32BE with: 16rD834.
self assert: [invalidUtf32BE isInvalid].
repairedUniStr := invalidUtf32BE asUnicodeString: true..
self assert: [repairedUniStr size = 1 and: [repairedUniStr unicodeScalars first = UnicodeScalar replacementCharacter]]
Class Methods
asUnicodeString:
  Answer a new unicode string instance created from @anObject according to the decoding algo of the subclass.
    
     Arguments:
        anObject - <Object> see subclass implementation for details
        repair - <Boolean>        
     Answers:
        <UnicodeString>
     Raises:
        <Exception> EsPrimErrValueOutOfRange if @anObject contains invalid bytes
asUnicodeString:repair:
  Answer a new unicode string instance created from @anObject according to the decoding algo of the subclass.
     If @repair is true, then invalid utf8 will be replaced with the unicode replacement character U+FFFD.
    
     Arguments:
        anObject - <Object> see subclass implementation for details
        repair - <Boolean>        
     Answers:
        <UnicodeString>
     Raises:
        <Exception> EsPrimErrValueOutOfRange if @anObject contains invalid bytes and @repair is false
copyFromOSMemory:
  Answer a new instance of UnicodeString,
     copying the bytes from anOSObject encoded in receiver's format.
     
     Arguments:
        anOSObject - <OSObject>
     Answers:
        <UnicodeString>
    
Instance Methods
asByteArray
  Answer the receiver as a <ByteArray>
     
     Examples:
        self assert: [(Utf16LE with: 16r3DD8 with: 16r00DE) asByteArray = #[216 61 222 0]]
     
     Answers:
        <ByteArray>
     Raises:
        <Exception> receiver with invalid content cannot be converted to unicode string
asUnicodeString
  Answer the receiver as a <UnicodeString> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <UnicodeString>
     Raises:
        <Exception> receiver with invalid content cannot be converted to unicode string
asUnicodeString:
  Answer the receiver as a <UnicodeString> instance.
     If @repair is true, then repair invalid encodings such
     that a valid unicode string can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Examples:
        'Grinning Face (U+1F600)'.
        self assert: [(Utf8 with: 16rF0 with: 16r9F with: 16r98 with: 16r80) asUnicodeString first name = 'GRINNING FACE'].
        
        'Invalid UTF8-encoding'.
        self assert: [[(Utf8 with: 233) asUnicodeString. false] on: Exception do: [:ex | ex exitWith: true]].
        
        'Repaired invalid UTF8-encoding'.
        self assert: [((Utf8 with: 233) asUnicodeString: true) = UnicodeScalar replacementCharacter asUnicodeString].
    
        self assert: [(System bigEndian
            ifTrue: [(Utf16 with: 16r3DD8 with: 16r00DE) asUnicodeString]
            ifFalse: [(Utf16 with: 16rD83D with: 16rDE00) asUnicodeString]) first name = 'GRINNING FACE'].
        
        'Invalid because of isolated surrogate - 16rD800'.
        self assert: [((Utf16LE with: 16rD800) asUnicodeString: true) = UnicodeScalar replacementCharacter asUnicodeString].
      
        'Grinning Face (U+1F600)'.
        self assert: [(Utf32LE with: 16r1F600) asUnicodeString first name = 'GRINNING FACE'].
        
        'Invalid UTF32-encoding'.
        self assert: [[(Utf32LE with: 16rD800) asUnicodeString. false] on: Exception do: [:ex | ex exitWith: true]].
        
        'Repaired invalid UTF8-encoding'.
        self assert: [((Utf32LE with: 16rD800) asUnicodeString: true) = UnicodeScalar replacementCharacter asUnicodeString].
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <UnicodeString>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf16
  Answer the receiver as a <Utf16> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf16>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf16
asUtf16:
  Answer the receiver as a <Utf16> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf16 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf16>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf16BE
  Answer the receiver as a <Utf16BE> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf16BE>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf16 big endian
asUtf16BE:
  Answer the receiver as a <Utf16BE> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf16 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf16BE>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf16LE
  Answer the receiver as a <Utf16LE> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf16LE>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf16 little endian
asUtf16LE:
  Answer the receiver as a <Utf16LE> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf16 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf16LE>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf32
  Answer the receiver as a <Utf32> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf32>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf32
asUtf32:
  Answer the receiver as a <Utf32> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf32 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf32>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf32BE
  Answer the receiver as a <Utf32BE> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf32BE>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf32 big endian
asUtf32BE:
  Answer the receiver as a <Utf32BE> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf32 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
     
     Arguments:
        repair - <Boolean>
     Answers:
        <Utf32BE>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf32LE
  Answer the receiver as a <Utf32LE> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf32LE>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf32 little endian
asUtf32LE:
  Answer the receiver as a <Utf32LE> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf32 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf32LE>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
asUtf8
  Answer the receiver as a <Utf8> instance.
     Do not repair invalid sequences, but instead raise
     an exception.
     
     Answers:
        <Utf8>
     Raises:
        <Exception> receiver with invalid content cannot be converted to utf8
asUtf8:
  Answer the receiver as a <Utf8> instance.
     If @repair is true, then repair invalid encoded elements
     such that a valid utf8 can be created.
     
     Repairing typically involves detecting invalid sequences
     and replacing with the unicode replacement character
     [UnicodeScalar replacementCharacter]
    
     Arguments:
        repair - <Boolean>     
     Answers:
        <Utf8>
     Raises:
        <Exception> if @repair is false and receiver contains invalid contents
isInvalid
  Answer true if the content of the container is invalid according to
     the rules of the encoding.
     
     Answers:
        <Boolean> true if invalid, false if valid
isValid
  Answer true if the content of the container is valid according to
     the rules of the encoding.
     
     @see method comments in subclass overrides for examples.
     
     Answers:
        <Boolean> true if valid, false if invalid
Last modified date: 01/06/2026