Let’s abuse a bug in java.lang.String
to make some weird Strings.
We’ll make “hello world” not start with “hello”, and show that not all empty Strings are equal to each other.
Intro: Equality among Strings
Before we get started, we look at what it takes for two Strings to be equal in the JDK.
Why is
"foo".equals("fox")
false?
Because Strings are compared character by character, and the third character of those Strings is different.
Why is
"foo".equals("foo")
true?
You might think that in this case the Strings are also compared character by character.
But String literals are interned.
When the same String appears as a constant multiple times in the source code, it’s not just another String with the same content.
It is the same instance of String.
And the first thing that String.equals does is if (this == anObject) { return true; }
, so it doesn’t even look at the contents.
Why is
"foo!".equals("foo⁉")
false?
Since JDK 9 (since JEP 254: Compact Strings), a String represents its content internally as a byte array.
“foo!” only contains simple characters, with a codepoint less than 256. The String class internally encodes such values using the latin-1 encoding, with one byte per character.
“foo⁉” contains a character (⁉) that cannot be represented using latin-1, so it encodes the whole String using UTF-16, with two bytes per character.
The String.coder
field keeps track of which of the two encodings was used.
When comparing two Strings with a different coder
, then String.equals
always returns false.
It does not even look at the contents, because if one String can be represented in latin-1, and the other one can not, then they can’t be the same. Or can they?
Note: The compact strings feature can be disabled, but it’s enabled by default. This blog post assumes it is enabled.
Creating a broken String
How are Strings created? How exactly does
java.lang.String
choose to use latin-1 or not?
Strings can be created in multiple ways, we’ll focus here on the String constructor that takes a char[]
.
It first tries to encode the characters as latin-1 using StringUTF16.compress
. If that fails, it returns null
and the constructor falls back to using UTF-16.
Here is a simplified version of how that is implemented.
(For readability I removed irrelevant indirections, checks and arguments from the actual implementation
here
and here)
/**
* Allocates a new {@code String} so that it represents the sequence of
* characters currently contained in the character array argument. The
* contents of the character array are copied; subsequent modification of
* the character array does not affect the newly created string.
*/
public String(char value[]) {
byte[] val = StringUTF16.compress(value);
if (val != null) {
this.value = val;
this.coder = LATIN1;
return;
}
this.coder = UTF16;
this.value = StringUTF16.toBytes(value);
}
There is the bug. This code does not always preserve the assumptions that String.equals
makes that we talked about above.
Do you see it?
The javadoc points out that