When characters arent characters – Writings from the sticks

There are two “text handling” types in Xojo – String and Text.

And they vary quite a bit in how they handle textual data.

While strings use UTF-8 as their default encoding you still have to worry about what form of UTF-8 the characters in the string are in. Strings dont deal with “characters” in the way you and I perceive them.

For instance if you run this code

Dim s1 As String = "ü"
Dim s2 As String = &u75 + &u308

Break

What you see in the debugger as the text they hold is the same

But if you chnage this to


Dim s1 As String = "ü"

Dim s2 As String = &u75 + &u308

If s1 = s2 Then 
  Break
End If

what you will find is that while they appear to you and me to hold the same contents they are not “the same”. And this is because the first one uses one form of UTF-8 (composed characters) and the second uses a different form (decomposed characters)

And theres no built in mechanism to know one is in one form or the other nor any to convert one into the other 🙁

So years ago the “next” framework was created and a new type added – Text. And it deals with these issues much better. That same code, using text, looks like

Dim t1 As Text = "ü"
Dim t2 As Text = &u75 + &u308

If t1 = t2 Then
  Break
End If

but this time when you run you WILL hit the break point. Text handles the different forms seamlessly and you get the result you expect.

And the differences go further than this. When you split a string up into “characters” you get different numbers of characters from the two apparently equal strings. Not so with text.


Dim s1 As String = "ü"

Dim s2 As String = &u75 + &u308

If s1 = s2 Then 
  Break // wont stop here but you might expect it should
End If

Dim s1Chars() As String = s1.Split("")
Dim s2Chars() As String = s2.Split("")

Break // note that s1chars. ubound < s2chars.ubound
      // and the contents are totally different

Dim s1CodePoints() As UInt32
For i As Integer = 1 To s1.LenB
  s1CodePoints.Append AscB(s1.MidB(i,1))
Next
Dim s2CodePoints() As UInt32
For i As Integer = 1 To s2.LenB
  s2CodePoints.Append AscB(s2.MidB(i,1))
Next

Break // again the ubounds are different - this time they should be !

Dim t1 As Text = "ü"

Dim t2 As Text = &u75 + &u308

If t1 = t2 Then
  Break
End If

Dim t1Chars() As Text = t1.Split
Dim t2Chars() As Text = t2.Split

Break // note that t1Chars.ubound = t2Chars.ubound
      // and the chars are "the same" !!!!!!

Dim t1CodePoints() As UInt32
For Each cp As UInt32 In t1.Codepoints
  t1CodePoints.Append cp
Next
Dim t2CodePoints() As UInt32
For Each cp As UInt32 In t2.Codepoints
  t2CodePoints.Append cp
Next

Break // these should differ since one uses one form of utf-8
      //  and one uses a different one

Text just handles things seamlessly

With the transition to API 2 it will be a shame if String doesnt adopt some of these capabilities AND there’s no framework provided means to normalize string so they all use UTF-8 composed or decomposed so we can deal with the inconsistencies that can arise.

2 Replies to “When characters arent characters”

Anonymous says:

April 25, 2020 at 5:15 pm

2020r1 will become String.Characters Iterator like Text.Characters. First step in the right direction.
1. Norman Palardy says:
  
  April 25, 2020 at 6:08 pm
  
  cant discuss unreleased versions

Comments are closed.