As I am editing the 13th edition of Core Java, I realize that I need to tone down the coverage of Unicode code points. I used to recommend that readers avoid
char
and use code points instead, but I came to realize that with modern Unicode, code points are problematic too. Just useString
.
The Best of Times, The Worst of Times
Transport yourself back in time to October 1991. Unicode 1.0.0 saw the light of day, and a bright day it was. With 7,129 characters and ample space in its 2-byte reservoir for the characters of all languages, past, present, or future. No more incompatible code pages for European languages, no more multi-byte encodings of Asian languages!
Unicode was an instant success. Operating systems and programming languages eagerly embraced it. In 1996, the Java 1.0 language specification confidently stated:
The integral types are byte
, short
, int
, and long
, whose values are 8-bit, 16-bit, 32-bit and 64-bit signed two’s-complement integers, respectively, and char
, whose values are 16-bit unsigned integers representing Unicode characters.
This happy state of affairs lasted almost ten years, until March 2001. Only five years for Java, though. When Unicode 3.1 was released, it broke through the 16-bit barrier and clocked in at 94,140 code points, due to the addition of many Chinese/Japanese/Korean ideographs. You would have thought they could have counted them ten years earlier…
👋 to fixed-width encoding. Unicode characters, now in the “basic multilingual plane” and 16 “astral planes”, can be encoded with UTF-8, a multi-byte encoding that requires one byte for classic ASCII, four bytes for astral characters (such as the waving hand sign), and two or three bytes for those in between. Over 98% of web pages are encoded with UTF-8. JEP 400 says that UTF-8 should be the default file encoding for Java version 18 and above.
That’s cold comfort for the Java VM (and Windows, which also embraced 16-bit characters). Having bought into the 16-bit char
world, they have to resort to the icky UTF-16 encoding that uses one 16-bit value for the characters in the basic multilingual plane and two 16-bit values for the astral characters, taking advantage of a small window of “surrogate” characters that could be pressed into service for those astral planes.
var wavingHandSign = "👋"; wavingHandSign.length() // 2 wavingHandSign.substring(0, 1) // A malformed string with one surrogate character
The result: the worst of all worlds. An encoding that is both bulky and variable-width.
As an aside, this is unrelated to the internal storage of strings in the JVM. JEP 254 specifies compact strings, where the JVM stores strings with no code point ≥ 256 in a byte
array and all others as a char
array. In principle, the JVM is free to choose any internal implementation, such as UTF-8. But for backward compatibility, any method that takes a UTF-16 based index (such as charAt
or substring
) needs to find the index in constant time. That limits the choices.
And as an aside to the aside: Check out this race condition that Wouter Coekaerts found in the String(char[])
constructor as a result of this optimization.
The Java API for Code Points
In the Core Java book, starting with Java 5, I started telling readers not to use char
but use code points instead. Some reviewers were unhappy and felt that one shouldn’t make such a big dea