Why Can’t You Reverse a String with a Flag Emoji? by da12

Share This Article

Sed ut perspiciatis unde.

What do you think is the output of the following Python code?

>>> flag = "🇺🇸"
>>> reversed_flag = flag[::-1]
>>> print(reversed_flag)

Questions like this make me want to immediately open a Python REPL and try the code out because I think I know what the answer is, but I’m not very confident in that answer.

Here’s my line of thinking when I first saw this question:

The flag string contains a single character.
The [::-1] slice reverses the flag string.
The reversal of a string with a single character is the same as the original string.
Therefore, reversed_flag must be "🇺🇸".

That’s a perfectly valid argument. But is the conclusion true? Take a look:

>>> flag = "🇺🇸"
>>> reversed_flag = flag[::-1]
>>> print(reversed_flag)
🇸🇺

What in the world is going on here?

Does `"🇺🇸"` Really Contain a Single Character?

When the conclusion of a valid argument is false, one of its premises must be false, too. Let’s start from the top:

The flag string contains a single character.

Is that so? How can you tell how many characters a string has?

In Python, you can use the built-in len() function to get the total number of characters in a string:

>>> len("🇺🇸")
2

Oh.

That’s weird. You can only see a single thing in the string "🇺🇸" — namely the US flag — but a length of 2 jives with the result of flag[::-1]. Since the reverse of "🇺🇸" is "🇸🇺", this seems to imply that somehow "🇺🇸" == "🇺 🇸".

How Can You Tell What Characters Are In a String?

There are a few different ways that you can see all of the characters in a string using Python:

>>> # Convert a string to a list
>>> list("🇺🇸")
['🇺', '🇸']

>>> # Loop over each character and print
>>> for character in "🇺🇸":
...     print(character)
...
🇺
🇸

The US flag emoji isn’t the only flag emoji with two characters:

>>> list("🇿🇼")  # Zimbabwe
['🇿', '🇼']

>>> list("🇳🇴")  # Norway
['🇳', '🇴']

>>> list("🇨🇺")  # Cuba
['🇨', '🇺']

>>> # What do you notice?

And then there’s the Scottish flag:

>>> list("🏴󠁧󠁢󠁳󠁣󠁴󠁿")
['🏴', 'U000e0067', 'U000e0062', 'U000e0073', 'U000e0063',
 'U000e0074', 'U000e007f']

OK, what is that all about?

💪🏻

Challenge: Can you find any non-emoji strings that look like a single character but actually contain two or more characters?

The unnerving thing about these examples is that they imply that you can’t tell what characters are in a string just by looking at your screen.

Or, perhaps more deeply, it makes you question your understanding of the term character.

What Is a Character, Anyway?

The term character in computer science can be confusing. It tends to get conflated with the word symbol, which, to be fair, is a synonym for the word character as it’s used in English vernacular.

In fact, when I googled character computer science, the very first result I got was a link to a Technopedia article that defines a character as:

“[A] display unit of information equivalent to one alphabetic letter or symbol.”

— Technopedia, “Character (Char)”

That definition seems off, especially in light of the US flag example that indicates that a single symbol may be comprised of at least two characters.

The second Google result I get is Wikipedia. In that article, the definition of a character is a bit more liberal:

“[A] character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language.”

— Wikipedia, “Character (computing)”

Hmm… using the word “roughly” in a definition makes the definition feel, shall I say, non-definitive.

But the Wikipedia article goes on to explain that the term character has been used historically to “denote a specific number of contiguous bits.”

Then, a significant clue to the question about how a string with one symbol can contain two or more characters:

“A character is most commonly assumed to refer to 8 bits (one byte) today… All [symbols] can be represented with one or more 8-bit code units with UTF-8.”

— Wikipedia, “Character (computing)”

OK! Maybe things are starting to make a little bit more sense. A character represents a unit of text and is often stored as one byte of information . The symbols that we see in a string can be made up of multiple 8-bit (1 byte) UTF-8 code units.

Characters are not the same as symbols. It seems reasonable now that one symbol could be made up of multiple characters, just like flag emojis.

But what is a UTF-8 code unit?

A little further down the Wikipedia article on characters, there’s a section called Encoding that explains:

“Computers and communication equipment represent characters using a character encoding that assigns each character to something – an integer quantity represented by a sequence of digits, typically – that can be stored or transmitted through a network. Two examples of usual encodings are ASCII and the UTF-8 encoding for Unicode.”

— Wikipedia, “Character (computing)”

There’s another mention of UTF-8! But now I need to know what a character encoding is.

What Exactly Is a Character Encoding?

According to Wikipedia, a character encoding assigns each character to a number. What does that mean?

Doesn’t it mean that you can pair each character with a number? So, you could do something like pair each uppercase letter in the English alphabet with an integer 0 through 25.

You can represent this pairing using tuples in Python:

>>> pairs = [(0, "A"), (1, "B"), (2, "C"), ..., (25, "Z")]
>>> # I'm omitting several pairs here -----^^^

Stop for a moment and ask yourself: “Can I create a list of tuples like the one above without explicitly writing out each pair?”

One way is to use Python’s enumerate() function. enumerate() takes an argument called iterable and returns a tuple containing a count that defaults to 0 and the values obtained from iterating over iterable.

Here’s a look at enumerate() in action:

>>> letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>>> enumerated_letters = list(enumerate(letters))
>>> enumerated_letters
[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'),
(7, 'H'), (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L'), (12, 'M'), (13, 'N'),
(14, 'O'), (15, 'P'), (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T'), (20, 'U'),
(21, 'V'), (22, 'W'), (23, 'X'), (24, 'Y'), (25, 'Z')]

There’s an easier way to make all of the letters, too.

Python’s string module has a variable called ascii_uppercase that points to a string containing all of the uppercase letters in the English alphabet:

>>> import string
>>> string.ascii_uppercase
'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

>>> enumerated_letters = list(enumerate(string.ascii_uppercase))
>>> enumerated_letters
[(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'), (6, 'G'),
 (7, 'H'), (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L'), (12, 'M'), (13, 'N'),
 (14, 'O'), (15, 'P'), (16, 'Q'), (17, 'R'), (18, 'S'), (19, 'T'),
 (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X'), (24, 'Y'), (25, 'Z')]

OK, so we’ve associated characters to integers. That means we’ve got a character encoding!

But, how do you use it?

To encode the string ”PYTHON” as a sequence of integers, you need a way to look up the integer associated with each character. But, looking things up in a list of tuples is hard. It’s also really inefficient. (Why?)

Dictionaries are good for looking things up. If we convert enumerated_letters to a dictionary, we can quickly look up the letter associated with an integer:

>>> int_to_char = dict(enumerated_letters)

>>> # Get the character paired with 1
>>> int_to_char[1]
'B'

>>> # Get the character paired with 15
>>> int_to_char[15]
'P'

However, to encode the string ”PYTHON” you need to be able to look up the integer associated with a character. You need the reverse of int_to_char.

How do you swap keys and values in a Python dictionary?

One way is use the reversed() function to reverse key-value pairs from the int_to_char dictionary:

>>> # int_to_char.items() is a "list" of key-value pairs
>>> int_to_char.items()
dict_items([(0, 'A'), (1, 'B'), (2, 'C'), (3, 'D'), (4, 'E'), (5, 'F'),
(6, 'G'), (7, 'H'), (8, 'I'), (9, 'J'), (10, 'K'), (11, 'L'), (12, 'M'),
(13, 'N'), (14, 'O'), (15, 'P'), (16, 'Q'), (17, 'R'), (18, 'S'),
(19, 'T'), (20, 'U'), (21, 'V'), (22, 'W'), (23, 'X'), (24, 'Y'),
(25, 'Z')])

>>> # The reversed() function can reverse a tuple
>>> pair = (0, "A")
>>> tuple(reversed(pair))
('A',

Why Can’t You Reverse a String with a Flag Emoji? by da12

Why Can’t You Reverse a String with a Flag Emoji? by da12

Share This Article

Newsletter

Does `"🇺🇸"` Really Contain a Single Character?

How Can You Tell What Characters Are In a String?

What Is a Character, Anyway?

What Exactly Is a Character Encoding?

HackTech

Leave a comment Cancel reply

Editor's Choice

Why Can’t You Reverse a String with a Flag Emoji? by da12

Why Can’t You Reverse a String with a Flag Emoji? by da12

Share This Article

Newsletter

Does "🇺🇸" Really Contain a Single Character?

How Can You Tell What Characters Are In a String?

What Is a Character, Anyway?

What Exactly Is a Character Encoding?

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

Does `"🇺🇸"` Really Contain a Single Character?