Slicing a string to get a substring
D's strings are actually just an array of characters. This means any operation that you can do on arrays, also works on strings. However, since string is a UTF-8 array, there are some behaviors that you may find surprising. Here, you'll get a substring by slicing and discuss potential pitfalls.
How to do it…
Let's try to get a substring from a string using the following steps:
Declare a string as follows:
string s = "月明かり is some Japanese text.";
Get the correct index for start and end. You'll get the Japanese text out by searching the string for the first space, and slice up to that point by using the following code:
import std.string; string japaneseText = s[0 .. s.indexOf(" ")];
Loop over the string, looking at the UTF-8 code units as well as the Unicode code points. So, you can see the difference in your string by using the following code:
import std.stdio; foreach(idx, char c; japaneseText) writefln("UTF-8 Code unit at index %d is %d", idx, c); foreach(dchar c; japaneseText) writefln("UTF-32 code unit with value %d is %c", c, c);
The program will print out more code units in UTF-8 than in dchars
, because the Japanese text is composed of multibyte characters, unlike English text.
How it works…
D's implementations of strings uses Unicode. Unicode is a complicated standard that could take up a whole book on its own, but you can use it in D knowing just some basics. D string, as well as D source code, uses UTF-8 encoding. This means you can paste in text from any language into a D source file and process it with D code.
However, UTF-8 has a complication; the length of a single code point is variable. Often, one code point is one character, though Unicode's complexity means graphemes (that is, what you might also call a visible character) may consist of more than one code point! For English text, UTF-8 beautifully maps directly to ASCII, which means that one code unit is one character. However, for other languages, there are too many characters to express in one byte. Japanese is one example where all the characters are multibyte in UTF-8.
So, while there are only four characters in your program, if you slice from s[0 .. 4]
, you won't get all four characters. D's slice operator works on code units. You'll get a partial result here, which may not be usable.
Instead, you found the correct index by using the standard library function indexOf
. This searches the string for the given substring and returns the index, or -1 if it could not be found. The slice [start .. end]
goes from start, including it, to the end, not including that. So, [0 .. indexOf(…)]
goes from the start, up to, but not including, the space. This slice is safe to use, even if it contains multibyte characters.
Finally, you looped over the Japanese text to examine the encodings. The foreach
loop understands UTF encoding. The first variant asks for characters, or UTF-8 code units, and yields them without decoding. The second variant asks for dchars
, which are UTF-32 code units that are numerically equivalent to Unicode code points. Asking for dchars
is slower than iterating over chars
, but has the advantage of removing much of the complexity of handling multibyte characters. The second loop prints only one entry per Japanese character, or any other character that cannot be encoded in a single UTF-8 unit.
There's more…
D also supports UTF-16 and UTF-32 strings. These are typed wstring
and dstring
, respectively. Let's look at each of these as follows:
wstring
: This is very useful on Windows, because the Windows operating system natively works with UTF-16.dstring
: This eats a lot of memory, about 4 times more than strings for English text, but sidesteps some of the issues discussed here. The reason is that each array index corresponds to one Unicode code point.