There are two “text handling” types in Xojo – String and Text.
And they vary quite a bit in how they handle textual data.
While strings use UTF-8 as their default encoding you still have to worry about what form of UTF-8 the characters in the string are in. Strings dont deal with “characters” in the way you and I perceive them.
For instance if you run this code
Dim s1 As String = "ü"
Dim s2 As String = &u75 + &u308
Break
What you see in the debugger as the text they hold is the same
But if you chnage this to
Dim s1 As String = "ü"
Dim s2 As String = &u75 + &u308
If s1 = s2 Then
Break
End If
what you will find is that while they appear to you and me to hold the same contents they are not “the same”. And this is because the first one uses one form of UTF-8 (composed characters) and the second uses a different form (decomposed characters)
And theres no built in mechanism to know one is in one form or the other nor any to convert one into the other 🙁
So years ago the “next” framework was created and a new type added – Text. And it deals with these issues much better. That same code, using text, looks like
Dim t1 As Text = "ü"
Dim t2 As Text = &u75 + &u308
If t1 = t2 Then
Break
End If
but this time when you run you WILL hit the break point. Text handles the different forms seamlessly and you get the result you expect.
And the differences go further than this. When you split a string up into “characters” you get different numbers of characters from the two apparently equal strings. Not so with text.
Dim s1 As String = "ü"
Dim s2 As String = &u75 + &u308
If s1 = s2 Then
Break // wont stop here but you might expect it should
End If
Dim s1Chars() As String = s1.Split("")
Dim s2Chars() As String = s2.Split("")
Break // note that s1chars. ubound < s2chars.ubound
// and the contents are totally different
Dim s1CodePoints() As UInt32
For i As Integer = 1 To s1.LenB
s1CodePoints.Append AscB(s1.MidB(i,1))
Next
Dim s2CodePoints() As UInt32
For i As Integer = 1 To s2.LenB
s2CodePoints.Append AscB(s2.MidB(i,1))
Next
Break // again the ubounds are different - this time they should be !
Dim t1 As Text = "ü"
Dim t2 As Text = &u75 + &u308
If t1 = t2 Then
Break
End If
Dim t1Chars() As Text = t1.Split
Dim t2Chars() As Text = t2.Split
Break // note that t1Chars.ubound = t2Chars.ubound
// and the chars are "the same" !!!!!!
Dim t1CodePoints() As UInt32
For Each cp As UInt32 In t1.Codepoints
t1CodePoints.Append cp
Next
Dim t2CodePoints() As UInt32
For Each cp As UInt32 In t2.Codepoints
t2CodePoints.Append cp
Next
Break // these should differ since one uses one form of utf-8
// and one uses a different one
Text just handles things seamlessly
With the transition to API 2 it will be a shame if String doesnt adopt some of these capabilities AND there’s no framework provided means to normalize string so they all use UTF-8 composed or decomposed so we can deal with the inconsistencies that can arise.
2020r1 will become String.Characters Iterator like Text.Characters. First step in the right direction.
cant discuss unreleased versions