Obfuscating Your Primary Keys by evanwalsh

Share This Article

Sed ut perspiciatis unde.

One of the most common follow-up feature requests I get from clients after building a web application is to make urls containing model ids more appealing. Nothing screams “we are amateurs!” like inviting a new user to your website and redirecting them to /users/4/ or sending a customer a billing invoice at /invoices/12/ or linking to a page on your new photo-sharing site at /photos/31/. This is a common problem that affects database-backed web applications that need to surface permalinks to content that is accessed with a primary key embedded in the url. This can be anything with a “Detail Page” view—users, products, orders, comments, photos, profiles, posts, etc. In this post I would like to share my own research into this question with a discussion of the options you might consider for different use cases.

In various forms, this question has been around for a very long time and has been asked numerous times on Stack Overflow [1], [2], [3], [4], Security Stack Exchange [5], [6], [7] [8], DBA Stack Exchange [9], [10], [11], [12], and in various other forums [13], [14], [15].

Many of these questions are motivated by similar concerns raised by my clients—how can we make ids in urls that make the site look a bit less lonely and a bit more professional? However, this is more than merely a cosmetic or aesthetic issue because information leaked by urls can be damaging when used in enumeration attacks. As an application developer you are surely protecting your site against username enumeration attacks, but similar techniques can also be used to reveal sensitive information about your customers, your products, or your business, when directed against other areas of your site that you may not think of securing as tightly as user access.

The most obvious example of information leakage here is traffic or volume. When using simple auto-incrementing integers, you not only give away the total number of existing ids everytime someone creates one, you also reveal the rate of creation when someone waits a week and checks to see how many new ids have been created in the meantime. But even if you don’t care about revealing this information, there is guaranteed to be a bored script-kiddie out there who will spot your sequential ids and proceed to slam your site with DDOS-level requests for every possible unsigned 32-bit integer on the off chance they might discover something interesting. Just for the lulz. Or, another may decide to simply scrape your entire product database and spin up their own copy somewhere else. There are other more subtle information “locality” leaks that you really have no way of knowing in advance what trouble they might lead to. Sequential ids make it easier to guess the id range of another user’s resources. They also reveal which of two objects was created first, or which other objects were created around the same time.

Some commenters have the mistaken belief that exposing the primary key of one of their database tables is a security breach. This may be true if you are using a natural primary key such as a social security number, but if you have technical primary keys implemented as simple auto-incrementing serials, then there is nothing special about them at all. Knowing that my customer_id in your database is 1334234 is useless to me or to anyone else, apart from the fact that it is used to access a particular web page. Obscuring that primary key with some alternate hashed representation like sdhkj478s presents essentially the same risk. In either case, there exists a mechanism to lookup a customer record using that identifier. The id itself is not a secret. However, the process of generating ids is what may need to be guarded in order to prevent the various forms of information leakage noted above.

Responses to this question are often filled with harsh criticism and strong warnings that ‘obscurity is not security!’. Reliance on obscurity to prevent unauthorized users from viewing secrets is hopefully not a mistake that you will make. I assume that you have meticulously reviewed the security section of the documentation of any web application framework you are using, that you are aware of and know how to protect against all the attacks listed in the OWASP Web Security Testing Guide, that you never hard-code credentials into source code, you know the difference between authn and authz, you never say ‘hacker’ when you mean ‘cracker’, etc.

But security is hard, really hard. Exploits have a habit of turning up in ways even a science fiction writer could hardly anticipate. Witness the Spectre and Meltdown attacks on Intel CPUs. You’d have to be some kind of twisted genius to even suspect the attack surface involved here, something that generations of chip designers hadn’t even conceived until the attack was revealed. The very nature of timing attacks and what it takes to guard against them is so antithetical to the typical engineering mindset that they can completely bulldoze our entire sense of security in one shot. For example, consider how you as an engineer would design a string comparison function. Surely, you’d aim for it to be as optimal as possible: terminating immediately upon determining that two strings cannot match. And this is exactly how optimization minded engineers unwittingly introduce security flaws. Using your tight optimal string matching function, an attacker can now reverse engineer the credentials to your website by timing the response—one character at a time.

The point is it doesn’t matter that it takes an evil genius to discover these flaws: five minutes after the embargo ends someone has a pre-packaged drive-by exploit that will drain your bank account, revoke your driver’s license, subscribe you to spam newsletters, and leave your refrigerator door open and all you did was click on a link to a cute kitten picture.

As a result, I would personally never underestimate the security profession, nor would I dare let my paranoia slip for even a moment to think that my skills are good enough to secure a website against a determined attacker. Of course, my clients are not holding state secrets, I never store payment instrument details, no PII, no HIPPA or COPPA concerns, and pretty much nothing of interest that could be monetizable outside of my customers’ businesses. But I still fear that one day I might discover one of my site’s databases dumped into a pastebin because I wasn’t fast enough doing a dnf update to correct some obscure security vulnerability in one of the 3,000 packages on the webserver.

So it’s not much of a surprise when this question comes up that it is met with a lot of alarm bells and a tendency to assume that the person asking the question has no idea what they’re doing and is very liable to end up with their data being actively traded on various dark markets. The seemingly innocuous goal of substituting sequential integers in urls with an encoded representation is treated as if the question poster had just asked why they can’t just chmod 777 *, turn off SELinux, and run an anonymous ftp server out of their home directory.

A lot of the skepticism boils down to the fact that obscuring the ids does nothing to change the fact that they are simply identifiers used to access a resource on your site, so what’s the point? Another class of responses addresses solving the information leakage problem by simply using GUIDs. Recent versions of major RDBMS are now all fairly good with UUID-like datatypes and despite it not being as efficient as simple integers, you are likely never to have a performance issue if set up correctly.

Lastly, after shredding the poster for advocating security through obscurity and proposing GUIDs as the solution to all of life’s woes, a lot of the comments in the thread amount to asking “who cares?” If it’s not a security feature and adds no functionality to the site, does it really matter how your urls look? This is largely a matter of personal preference that each developer will need to decide for themselves. However, given the number of times I have been approached to “fix” the urls because “they look dumb”, I can only suggest that you familiarize yourself with the possibilities because eventually a client or a project manager will come to you with this exact scenario.

I’ll explore several different classes of solutions that I’ve considered for various use cases that are all mainly focused on the cosmetic treatment of ids.

Tweaking the sequence
GUIDs
Hashing techniques
Irreversible randomness
Feistel ciphers
Other ciphers
Random permutations

Tweaking the sequence

A simple and unobtrusive way to disguise the paucity of rows in your database that might otherwise be a source of embarrassment to your client is to restart the sequence at a larger number, rather than just 1.

CREATE SEQUENCE model_id_seq AS int MINVALUE 100000 INCREMENT BY 23;

This method will quickly jumpstart your sequence so that your new site looks like it has a bit more history behind it. By maintaining a monotonically increasing sequence, this method also maintains sort order of the ids. However, this quickly resembles the “New Checking Account” problem. In the same way that nobody is fooled into thinking you’ve written a thousand checks when you hand them one with number 1001, your clients won’t be impressed into thinking you’ve had twenty million jobs when you send them invoice number 20210001, especially if the next one you send them 4 months later is 20210002. Optionally, you could also further tweak the sequence by specifying an increment so that successively generated ids will be incremented by some value more than just 1.

These tweaks might be good enough for your site, but they really don’t do much to obscure your traffic because the increment will always result in a regular pattern. This is also quite wasteful because you will be skipping a large portion of available numbers. Of course, by the time you reach the 32-bit integer limit, you will surely have a much better plan—and a bigger budget—for dealing with this question!

In conclusion, I would recommend this option only if you’re looking for a fast way of introducing some diversity to your id sequence.

GUIDs

The GUID, or UUID, is a 128-bit number typically represented as 32 hexadecimal characters. With 5e36 unique available GUIDs, there is virtually no chance of a collision with another GUID well before the end of the universe—assuming the algorithm generating the GUIDs is correctly implemented. By convention, GUIDs are written in 4 fields separated by dashes so you usually see them occupying 36 characters. DO NOT store them in your database as varchar(36). Use a native data type which modern database systems all have available.

Given the extremely low chance of collision, combined with the fact that GUIDs are now well optimized in databases such that they introduce only marginal cost compared to integers, GUIDs are the most commonly suggested solution to the question of obfuscating primary key sequences. You should definitely consider them when you want to solve this problem in a standard way and you don’t mind having this long ugly string in your urls.

While GUIDs easily solve the information leakage problem of an integer sequence, they don’t do much for the cosmetic concern. Now, instead of /users/4/ you will have urls like /users/0a28dc8f-53b7-40c3-bb16-2f280412544f. I can tell you from experience that the same clients who object to the former will not be too keen on the GUID solution either. It’s not just cosmetic, there’s a real usability issue with GUIDs as well. Basically, you need to be 100% certain that noone will ever want to write down by hand or speak the url, ever. For example, if you’re ever in a situation where a user needs to call customer support and read out their user id, forcing them to read out a GUID over the phone is a textbook case of worst user experience ever. Another example would be coupon codes or other short codes that need to be not easily guessable.

GUIDs solve the technical problems of the system, but they’re horrible for users that need to use those systems. Since urls constitute part of that human interface, you do need to think through all the possible ways your ids will be needed by humans before committing to GUIDs. And remember it’s not just clients and users who will be judging your site based on urls—other programmers will too. Every time I see a site using GUIDs for their model ids, I think wow, what a cop-out, the lazy programmers just took the easy way out with these ugly GUIDs. I bet they also store them in their database as varchar(36).

Hashing techniques

Another class of solutions to this problem uses modular arithmetic and bit shuffling to obfuscate a number or string in a reversible way that is also not easily reverse engineered. A popular example is hashids. The benefit here is that you can specify an output alphabet to which the input numbers can be matched. This lets you map your input integers into a condensed alphanumeric string that will be a lot shorter than the input. These so-called “short codes” are a lot easier for users to handle compared to GUIDs while also retaining a large input space.

The problem with these approaches is that they attempt to imitate crypto without going so far as to implement crypto. This can easily manifest a false sense of security as you might begin to think that it would be too hard for someone to reverse the algorithm and obtain your original sequence. Be incredibly wary of this false sense of security because it has already been demonstrated that hashids is easily broken. Well, maybe not easy for you and me, but for the people who do this sort of thing for a living, it’s trivial. In general, any time you see published tech talks about breaking the security of some system, it means there’s a kit circulating in the wild that wraps up the whole exploit into a 7 line shell script. To you and me it might look like encryption. But to Mossad, it’s no more of a challenge than rot13.

Another problem is that home-grown pseudo-crypto like this is horribly inefficient. Actual cryptography is pretty much by definition very computationally intensive. Libraries that implement it are also designed to take maximum advantage of the hardware to make cryptographic functions as optimized as possible. Doing this sort of thing outside of an actual crypto library—in an interpreted language running on a webserver—basically means hammering your CPU every time you need to compute one of these ids. So, not only do you not get the security of actual cryptography, you also waste a lot of resources calculating an obfuscated id that some kid out there can trivially reverse.

Irreversible randomness

Oftentimes you may want a random id that retains some useful properties of an integer sequence like sortability but doesn’t need to be reversible. In this case, you aren’t mapping an input sequence, you simply want to generate a random id on demand and assign it to an object. This combines the benefit of a large random space—not as large as a GUID but good enough—along with a timestamp so that you can maintain sort order. The best explanation and example of this approach is described in this post on the Instagram Engineering blog. There’s another good discussion here, and another similar example as implemented by Firebase.

This technique was developed for dealing with massive scale, when you would ordinarily turn to a GUID but want to maintain time-ordered sortability while thousands of objects are being created every second. Such environments are very likely using sharding because this kind of scale can’t be managed on a single database server. As a result, you can’t have a primary key sequence that can orchestrate the assignment of unique ids across shards.

Even though none of my sites has anywhere near this kind of scale, I’ve used this technique in a few places because the effect does lend an air of professionalism. You have very large numbers incrementing in an unpredictable way over time. This method provides a nice illusion that there’s a massive scale in operation behind your modest objects.

Feistel ciphers

Feistel ciphers are a nifty way of mapping an arbitrary sequence to a non-cyclical permutation within the same range. This method is a lot more flexible and thus can cover more use cases than any of the other techniques described thus far. It works well not only for obfuscating primary keys, but also for generating short codes, coupon codes, or something like a “license plate” format. You can also use this quite effectively for small input ranges, say only the numbers from 100000 to 999999.

Feistel ciphers are a method of format-preserving encryption, meaning the output range is the same as the input range. Moreover, the result is fully reversible and unique: each input maps to exactly one output and vice versa. Bearing in mind that anything with a small key size can probably be broken by brute force, Feistel ciphers are much more difficult to decipher compared to something like hashids. The calculation is also not particularly intense which means you get almost all the benefits of a random permutation without trading off any of the time or storage requirements.

The implementation I like the most and have used in production numerous times is Daniel Vérité’s PostgreSQL extension permuteseq, which was developed based on the Pseudo encrypt function. Daniel’s README as well as the wiki article provide some good background on why and how to use it.

I’ll share two examples where I used this in different scenarios. The first case involves

Obfuscating Your Primary Keys by evanwalsh

Obfuscating Your Primary Keys by evanwalsh

Share This Article

Newsletter

Tweaking the sequence

GUIDs

Hashing techniques

Irreversible randomness

Feistel ciphers

HackTech

Leave a comment Cancel reply

Editor's Choice

Obfuscating Your Primary Keys by evanwalsh

Obfuscating Your Primary Keys by evanwalsh

Share This Article

Newsletter

Tweaking the sequence

GUIDs

Hashing techniques

Irreversible randomness

Feistel ciphers

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter