Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
D Cookbook

You're reading from   D Cookbook Discover the advantages of programming in D with over 100 incredibly effective recipes with this book and ebook.

Arrow left icon
Product type Paperback
Published in May 2014
Publisher
ISBN-13 9781783287215
Length 362 pages
Edition Edition
Arrow right icon
Author (1):
Arrow left icon
Adam Ruppe Adam Ruppe
Author Profile Icon Adam Ruppe
Adam Ruppe
Arrow right icon
View More author details
Toc

Table of Contents (21) Chapters Close

D Cookbook
Credits
Foreword
About the Author
About the Reviewers
www.PacktPub.com
Preface
Core Tasks FREE CHAPTER Phobos – The Standard Library Ranges Integration Resource Management Wrapped Types Correctness Checking Reflection Code Generation Multitasking D for Kernel Coding Web and GUI Programming Addendum Index

Slicing a string to get a substring


D's strings are actually just an array of characters. This means any operation that you can do on arrays, also works on strings. However, since string is a UTF-8 array, there are some behaviors that you may find surprising. Here, you'll get a substring by slicing and discuss potential pitfalls.

How to do it…

Let's try to get a substring from a string using the following steps:

  1. Declare a string as follows:

    string s = "月明かり is some Japanese text.";
  2. Get the correct index for start and end. You'll get the Japanese text out by searching the string for the first space, and slice up to that point by using the following code:

    import std.string;
    string japaneseText = s[0 .. s.indexOf(" ")];
  3. Loop over the string, looking at the UTF-8 code units as well as the Unicode code points. So, you can see the difference in your string by using the following code:

    import std.stdio;
    foreach(idx, char c; japaneseText)
       writefln("UTF-8 Code unit at index %d is %d", idx, c);
    foreach(dchar c; japaneseText)
       writefln("UTF-32 code unit with value %d is %c", c, c);

The program will print out more code units in UTF-8 than in dchars, because the Japanese text is composed of multibyte characters, unlike English text.

How it works…

D's implementations of strings uses Unicode. Unicode is a complicated standard that could take up a whole book on its own, but you can use it in D knowing just some basics. D string, as well as D source code, uses UTF-8 encoding. This means you can paste in text from any language into a D source file and process it with D code.

However, UTF-8 has a complication; the length of a single code point is variable. Often, one code point is one character, though Unicode's complexity means graphemes (that is, what you might also call a visible character) may consist of more than one code point! For English text, UTF-8 beautifully maps directly to ASCII, which means that one code unit is one character. However, for other languages, there are too many characters to express in one byte. Japanese is one example where all the characters are multibyte in UTF-8.

So, while there are only four characters in your program, if you slice from s[0 .. 4], you won't get all four characters. D's slice operator works on code units. You'll get a partial result here, which may not be usable.

Instead, you found the correct index by using the standard library function indexOf. This searches the string for the given substring and returns the index, or -1 if it could not be found. The slice [start .. end] goes from start, including it, to the end, not including that. So, [0 .. indexOf(…)] goes from the start, up to, but not including, the space. This slice is safe to use, even if it contains multibyte characters.

Finally, you looped over the Japanese text to examine the encodings. The foreach loop understands UTF encoding. The first variant asks for characters, or UTF-8 code units, and yields them without decoding. The second variant asks for dchars, which are UTF-32 code units that are numerically equivalent to Unicode code points. Asking for dchars is slower than iterating over chars, but has the advantage of removing much of the complexity of handling multibyte characters. The second loop prints only one entry per Japanese character, or any other character that cannot be encoded in a single UTF-8 unit.

There's more…

D also supports UTF-16 and UTF-32 strings. These are typed wstring and dstring, respectively. Let's look at each of these as follows:

  • wstring: This is very useful on Windows, because the Windows operating system natively works with UTF-16.

  • dstring: This eats a lot of memory, about 4 times more than strings for English text, but sidesteps some of the issues discussed here. The reason is that each array index corresponds to one Unicode code point.

You have been reading a chapter from
D Cookbook
Published in: May 2014
Publisher:
ISBN-13: 9781783287215
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image