
What Is Entropy? by jfantl
People say many things about entropy: entropy increases with time, entropy is disorder, entropy increases with energy, entropy determines the arrow of time, etc.. But I have no idea what entropy is, and from what I find, neither do most other people. This is the introduction I wish I had when first told about entropy, so hopefully you find it helpful. My goal is that by the end of this long post we will have a rigorous and intuitive understanding of those statements, and in particular, why the universe looks different when moving forward through time versus when traveling backward through time.
This journey begins with defining and understanding entropy. There are multiple formal definitions of entropy across disciplines—thermodynamics, statistical mechanics, information theory—but they all share a central idea: entropy quantifies uncertainty. The easiest introduction to entropy is through Information Theory, which will lead to entropy in physical systems, and then finally to the relationship between entropy and time.
Information Theory
Imagine you want to communicate to your friend the outcome of some random events, like the outcome of a dice roll or the winner of a lottery, but you want to do it with the fewest number of bits (only 1s and 0s) as possible. How few bits could you use?
The creator of Information Theory, Claude Shannon, was trying to answer questions such as these during his time at Bell labs. He was developing the mathematical foundations of communication and compression, and eventually he discovered that the minimum number of bits required for a message was directly related to the uncertainty of the message. He was able to then formulate an equation to quantify the uncertainty of a message. When he shared it with his physicist colleague at Bell Labs, John von Neumann, von Neumann suggested calling it entropy for two reasons:
Von Neumann, Shannon reports, suggested that there were two good reasons for calling the function “entropy”. “It is already in use under that name,” he is reported to have said, “and besides, it will give you a great edge in debates because nobody really knows what entropy is anyway.” Shannon called the function “entropy” and used it as a measure of “uncertainty,” interchanging the two words in his writings without discrimination.
— Harold A. Johnson (ed.), Heat Transfer, Thermodynamics and Education: Boelter Anniversary Volume (New York: McGraw-Hill, 1964), p. 354.
Later we will see that the relationship between Shannon’s entropy and the pre-existing definition of entropy was more than coincidental, they are deeply intertwined.
But now let us see how Shannon found definitions for these usually vague terms of “information” and “uncertainty”.
In Information Theory, the information of an observed state is formally defined as the number of bits needed to communicate that state (at least for a system with equally likely outcomes with powers of two, we’ll see shortly how to generalize this). Here are some examples of information:
- If I flip a fair coin, it will take one bit of information to tell you the outcome: I use a
0
for head and a1
for tails. - If I roll a fair 8-sided dice, I can represent the outcome with 3 bits: I use
000
for a 1,001
for 2,010
for 3, etc.
The more outcomes a system can have, the more bits (information) it will require to represent its outcome. If a system has $N$ equally likely outcomes, then it will take $text{log}_2(N)$ bits of information to represent an outcome of that system.
Entropy is defined as the expected number of bits of information needed to represent the state of a system (this is a lie, but it’s the most useful definition for the moment, we’ll fix it later). So the entropy of a coin is 1 since on average we expect it to take 1 bit of information to represent the outcome of the coin. An 8-sided dice will have an entropy of 3 bits, since we expect it to take an average of 3 bits to represent the outcome.
It initially seems that entropy is an unnecessary definition since we can just look at how many bits it takes to represent the outcome of our system and use that value, but this is only true when the chance of the outcomes are all equally likely.
Imagine now that I have a weighted 8-sided dice, so the number 7 comes up $50$% of the time while the rest of the faces come up $approx 7.14$% of the time. Now, if we are clever, we can reduce the expected number of bits needed to communicate the outcome of the dice. We can decide to represent a 7 with a 0
, and all the other numbers will be represented with 1XXX
where the X
s are some unique bits. This would mean that $50$% percent of the time we only have to use 1 bit of information to represent the outcome, and the other $50$% of the time we use 4 bits, so the expected number of bits (the entropy of the dice) is 2.5. This is lower than the 3 bits of entropy for the fair 8-sided dice.
Fortunately, we don’t need to come up with a clever encoding scheme for every possible system, there exists a pattern to how many bits of information it takes to represent a state with probability $p$. We know if $p=0.5$ such as in the case of a coin landing on heads, then it takes 1 bit of information to represent that outcome. If $p=0.125$ such as in the case of a fair 8-sided dice landing on the number 5, it takes 3 bits of information to represent that outcome. If $p=0.5$ such as in the case of our unfair 8-sided dice landing on the number 7, then it takes 1 bit of information, just like the coin, which shows us that all that matters is the probability of the outcome. With this, we can discover an equation for the number of bits of information needed for a state with probability $p$.
[I(p) = -text{log}_2(p)]
This value $I$ is usually called information content or surprise, since the lower the probability of a state occurring, the higher the surprise when it does occur.
When the probability is low, the surprise is high, and when the probability is high, the surprise is low. This is a more general formula then “the number of bits needed” since it allows for states that are exceptionally likely (such as $99$% likely) to have surprise less then 1, which would make less sense if we tried to interpret the value as “the number of needed bits to represent the outcome”.
And now we can fix our definition of entropy (the lie I told earlier). Entropy is not necessarily the expected number of bits used to represent a system (although it is when you use an optimal encoding scheme), but more generally the entropy is the expected surprise of the system.
And now we can calculate the entropy of systems like a dice or a coin or any system with known probabilities for its outcomes. The expected surprise (entropy) of a system with $N$ possible outcomes each with probability $p_i$ (all adding up to 1) can be calculated as
[begin{align} sum_{i=1}^{N} p_i cdot I(p_i) = – sum_{i=1}^{N} p_i cdot text{log}_2(p_i)label{shannon_entropy}tag{Shannon entropy}\ end{align}]
And notice that if all the $N$ probabilities are the same (so $p_i = frac{1}{N}$), then the entropy equation can simplify to
[- sum_{i=1}^{N} p_i cdot text{log}_2(p_i) Rightarrow text{log}_2(N)]
Here are some basic examples using $eqref{shannon_entropy}$.
- The entropy of a fair coin is
[- ( 0.5 cdot text{log}_2(0.5) + 0.5 cdot text{log}_2(0.5)) = text{log}_2(2) = 1]
- The entropy of a fair 8-sided dice is
[- sum_{i=1}^{8} 0.125 cdot text{log}_2(0.125) = text{log}_2(8) = 3]
- The entropy of an unfair 8-sided dice, where the dice lands on one face $99$% of the time and lands on the other faces the remaining $1$% of the time with equal probability (about $0.14$% each), is
[- (0.99 cdot text{log}_2(0.99) + sum_{i=1}^{7} 0.0014 cdot text{log}_2(0.0014)) = 0.10886668511648723]
Hopefully it is a bit more intuitive now that entropy represents uncertainty. An 8-sided dice would have higher entropy than a coin since we are more uncertain about the outcome of the 8-sided dice than we are about the coin (8 equally likely outcomes are more uncertain than only 2 equally likely outcomes). But a highly unfair 8-sided dice has less entropy than even a coin since we have very high certainty about the outcome of the unfair dice. Now we have an actual equation to quantify that uncertainty (entropy) about a system.
It is not clear right now how this definition of entropy has anything to do with disorder, heat, or time, but this idea of entropy as uncertainty is fundamental to understanding the entropy of the universe which we will explore shortly. For reference, this definition of entropy is called Shannon entropy.
We will move on now, but I recommend looking further into Information Theory. It has many important direct implications for data compression, error correction, cryptography, and even linguistics, and touches nearly any field that deals with uncertainty, signals, or knowledge.
Physical Entropy
Now we will see entropy from a very different lens, that of Statistical Mechanics. We begin with the tried-and-true introduction to entropy which every student is given.
Balls in a box
I shall give you a box with 10 balls in it, $p_0$ through $p_9$, and we will count how many balls are on the left side of the box and on the right side of the box. Assume every ball is equally likely to be on either side. Immediately we can see it is highly unlikely that we count all the balls are on the left side of the box, and more likely that we count an equal number of balls on each side. Why is that?
Well, there is only one state in which we count all the balls on the left, and that is if every ball is on the left (truly astounding, but stay with me). But there are many ways in which the box is balanced: We could have $p_0$ through $p_4$ one side and the rest on the other, or the same groups but flipped from left to right, or we could have all the even balls on one side and the odd on the other, or again flipped, or any of the other many possible combinations.
This box is a system that we can measure the entropy of, at least once I tell you how many balls are counted on each side. It can take a moment to see, but imagine the box with our left and right counts as a system where the outcome will be finding out where all the individual balls are in the box, similar to rolling a dice and seeing which face it lands on.
This would mean that the box where we count all the balls on the left side only has one possible outcome: all the balls are on the left side. We would take this to mean that this system has $0$ entropy (no expected surprise) since we already know where we will find each individual ball.
The box with balanced sides (5 on each) has many possible equally likely outcomes, and in fact, we can count them. A famous equation in combinatorics is the N-choose-k equation, which calculates exactly this scenario. It tells us that there are 252 possible ways in which we can place 5 balls on each side. The entropy for this system would then be $- sum_{i=1}^{252} frac{1}{252} cdot text{log}_2(frac{1}{252}) = text{log}_2(252) = 7.9772799235$. This is the same as calculating the entropy of a 252-sided dice.
And if we were to increase the number of balls, the entropy of the balanced box would increase since there would then be even more possible combinations that could make up a balanced box.
We should interpret these results as: The larger the number of ways there are to satisfy the large-scale measurement (counting the number of balls on each side), the higher the entropy of the system. When all the balls are on the left, there is only one way to satisfy that measurement and so it has a low entropy. When there are many ways to balance it on both sides, it has high entropy.
Here we see 1000 balls bouncing around in a box. They will all start on the left, so the box would have 0 entropy, but once the balls start crossing to the right and changing the count on each side, the entropy will increase.
In
15 Comments
IIAOPSW
Its the name for the information bits you don't have.
More elaborately, its the number bits needed to fully specify something which is known to be in some broad category of state but the exact details to calculate it are unknown.
alganet
Nowadays, it seems to be a buzzword to confuse people.
We IT folk should find another word for disorder that increases over time, specially when that disorder has human factors (number of contributors, number of users, etc). It clearly cannot be treated in the same way as in chemistry.
bargava
Here is a good overview on Entropy [1]
[1] https://arxiv.org/abs/2409.09232
brummm
I love that the author clearly describes why saying entropy measures disorder is misleading.
glial
One thing that helped me was the realization that, at least as used in the context of information theory, entropy is a property of an individual (typically the person receiving a message) and NOT purely of the system or message itself.
> entropy quantifies uncertainty
This sums it up. Uncertainty is the property of a person and not a system/message. That uncertainty is a function of both a person's model of a system/message and their prior observations.
You and I may have different entropies about the content of the same message. If we're calculating the entropy of dice rolls (where the outcome is the 'message'), and I know the dice are loaded but you don't, my entropy will be lower than yours.
ponty_rick
As a software engineer, I learned what entropy was in computer science when I changed the way that a function was called which caused the system to run out of entropy in production and caused an outage. Heh.
DadBase
My old prof taught entropy with marbles in a jar and cream in coffee. “Entropy,” he said, “is surprise.” Then he microwaved the coffee until it burst. We understood: the universe favors forgetfulness.
NitroPython
Love the article, my mind is bending but in a good way lol
gozzoo
The visualisation is great, the topic is interesting and very well explained. Can sombody recomend some other blogs with similar type of presentation?
nihakue
I'm not in any way qualified to have a take here, but I have one anyway:
My understanding is that entropy is a way of quantifying how many different ways a thing could 'actually be' and yet still 'appear to be' how it is. So it is largely a result of an observer's limited ability to perceive / interrogate the 'true' nature of the system in question.
So for example you could observe that a single coin flip is heads, and entropy will help you quantify how many different ways that could have come to pass. e.g. is it a fair coin, a weighted coin, a coin with two head faces, etc. All these possibilities increase the entropy of the system. An arrangement _not_ counted towards the system's entropy is the arrangement where the coin has no heads face, only ever comes up tails, etc.
Related, my intuition about the observation that entropy tends to increase is that it's purely a result of more likely things happening more often on average.
Would be delighted if anyone wanted to correct either of these intuitions.
karpathy
What I never fully understood is that there is some implicit assumption about the dynamics of the system. So what that there are more microstates of some macrostate as far as counting is concerned? We also have to make assumptions about the dynamics, and in particular about some property that encourages mixing.
TexanFeller
I don’t see Sean Carroll’s musings mentioned yet, so repeating my previous comment:
Entropy got a lot more exciting to me after hearing Sean Carroll talk about it. He has a foundational/philosophical bent and likes to point out that there are competing definitions of entropy set on different philosophical foundations, one of them seemingly observer dependent:
– https://youtu.be/x9COqqqsFtc?si=cQkfV5IpLC039Cl5
– https://youtu.be/XJ14ZO-e9NY?si=xi8idD5JmQbT5zxN
Leonard Susskind has lots of great talks and books about quantum information and calculating the entropy of black holes which led to a lot of wild new hypotheses.
Stephen Wolfram gave a long talk about the history of the concept of entropy which was pretty good: https://www.youtube.com/live/ocOHxPs1LQ0?si=zvQNsj_FEGbTX2R3
jwilber
There’s an interactive visual of Entropy here in the Where To Partition section (midway thru the article): https://mlu-explain.github.io/decision-tree/
vitus
The problem with this explanation (and with many others) is that it misses why we should care about "disorder" or "uncertainty", whether in information theory or statistical mechanics. Yes, we have the arrow of time argument (second law of thermodynamics, etc), and entropy breaks time-symmetry. So what?
The article hints very briefly at this with the discussion of an unequally-weighted die, and how by encoding the most common outcome with a single bit, you can achieve some amount of compression. That's a start, and we've now rediscovered the idea behind Huffman coding. What information theory tells us is that if you consider a sequence of two dice rolls, you can then use even fewer bits on average to describe that outcome, and so on; as you take your block length to infinity, your average number of bits for each roll in the sequence approaches the entropy of the source. (This is Shannon's source coding theorem, and while entropy plays a far greater role in information theory, this is at least a starting point.)
There's something magical about statistical mechanics where various quantities (e.g. energy, temperature, pressure) emerge as a result of taking partial derivatives of this "partition function", and that they turn out to be the same quantities that we've known all along (up to a scaling factor — in my stat mech class, I recall using k_B * T for temperature, such that we brought everything back to units of energy).
https://en.wikipedia.org/wiki/Partition_function_(statistica…
https://en.wikipedia.org/wiki/Fundamental_thermodynamic_rela…
If you're dealing with a sea of electrons, you might apply the Pauli exclusion principle to derive Fermi-Dirac statistics that underpins all of semiconductor physics; if instead you're dealing with photons which can occupy the same energy state, the same statistical principles lead to Bose-Einstein statistics.
Statistical mechanics is ultimately about taking certain assumptions about how particles interact with each other, scaling up the quantities beyond our ability to model all of the individual particles, and applying statistical approximations to consider the average behavior of the ensemble. The various forms of entropy are building blocks to that end.
anon84873628
Nitpick in the article conclusion:
>Heat flows from hot to cold because the number of ways in which the system can be non-uniform in temperature is much lower than the number of ways it can be uniform in temperature …
Should probably say "thermal energy" instead of "temperature" if we want to be really precise with our thermodynamics terms. Temperature is not a direct measure of energy, rather it is an extensive property describing the relationship between change in energy to change in entropy.