PostgreSQL Jsonb – Powerful Storage for Semi-Structured Data by thunderbong

Share This Article

Sed ut perspiciatis unde.

Object-oriented or relational? You’ve seen this battle in your projects. For many years, we tried to fit the business data, which is usually grouped by business use case, into a normalised table structure. Sometimes it fits better, sometimes worse. We learn to cheat it with Object-Relational Mappers. They fixed some issues but created others.

Then document databases appeared, trying to look at this perspective from a different angle. They tried to accept that real-world data is fuzzy and usually doesn’t follow the rigid structure, but it’s semi-structured. It’s the classic dilemma: rigid schemas that ensure data integrity but resist change, versus flexible document stores that adapt easily but sacrifice guarantees.

Still, those databases didn’t give us the consistency guarantees we were comfortable with, like transactions, isolation levels, etc.

PostgreSQL added the JSONB column type to answer the question: “Can we have both?”.

Then other databases like MySQL and SQLite followed.

Let’s discuss today how it works, and if it’s the actual answer.

First, terminology: the B in JSON stands for Binary. PostgreSQL also has a JSON type, but they’re different implementations:

JSON stores data as text, requiring parsing on each operation.
JSONB uses a binary representation.

The binary format of JSONB enables efficient indexing and querying and eliminates the need for repeated parsing. The downside is that it takes up more storage space (as it stores not only text but also metadata) and loses the exact formatting and ordering of the original JSON input. That’s usually acceptable, but worth knowing.

You can think of a JSONB column as a table inside the column. Each path to property can be seen logically as a different column.

To visualise this conceptual model, imagine if our database could access any path directly without scanning the entire document:

+----+---------------+--------------------------+--------------------+
| id | customer.name | customer.contact.address | email              |
+----+---------------+--------------------------+--------------------+
| 1  | "John Smith"  | "Portland"               | "john@example.com" | 
+----+---------------+--------------------------+--------------------+

Of course, that’s a bit of an oversimplification.

PostgreSQL doesn’t create these virtual columns, but the binary representation effectively allows it to access paths directly without scanning the whole document. This approach provides several benefits:

Path traversal is much faster than parsing text
Fields can be indexed individually
Query performance is more predictable

When Postgresql stores a JSONB document, it doesn’t simply dump JSON text into a field. It transforms the document into a binary representation.

Have a look on the following JSON:

{
  "customer": {
    "name": "John Smith",
    "contact": {
      "email": "john@example.com",
      "phone": "555-1234",
      "address": {
        "city": "Portland",
        "country": "USA"
      }
    }
  }
}

This hierarchical structure could be flattened into a set of paths for each nested structure:

Path: customer.name = "John Smith"
Path: customer.contact.email = "john@example.com"
Path: customer.contact.phone = "555-1234"
Path: customer.contact.address.city = "Portland"
Path: customer.contact.address.country = "USA"

The goal is: to performantly read a specific field without parsing the entire document. How does it achieve it? To understand it, let’s discuss…

JSONB’s document is stored as a hierarchical, tree-like structure of key-value pairs, each containing metadata about its type and actual data. To enable versioning, it begins with a “magical” single-byte header that identifies the format’s version (currently version 1) and a header with metadata.

When new JSON is stored, it has to be parsed and converted from text to key-value pairs. This conversion happens through a process called “tokenisation“.

For example, using our customer data example:

{
  "id": "cust_web_123",
  "email": "web_user@example.com",
  "source": "website",
  "web_data": {
    "first_visit": "2023-03-12T15:22:47Z",
    "utm_source": "google_search"
  },
  "contact": {
    "phone": "+1234567890"
  }
}

Would be tokenised and stored with internal tree-like structures tracking:

Root: Object {
  Count: 5,
  Children: [
    Key: "id", 
    Value: String "cust_web_123",
    
    Key: "email",
    Value: String "web_user@example.com",
    
    Key: "source",
    Value: String "website",
    
    Key: "web_data", 
    Value: Object {
      Count: 2,
      Children: [
        Key: "first_visit",
        Value: String "2023-03-12T15:22:47Z",
        
        Key: "utm_source",
        Value: String "google_search"
      ]
    },
    
    Key: "contact",
    Value: Object {
      Count: 1,
      Children: [
        Key: "phone",
        Value: String "+1234567890"
      ]
    }
  ]
}

Of course, it’s a bit simplified form, but you can think that each array element is actually a different node, and the index in the array is just a nested path. Each key and value in a JSONB document is stored with its data type and metadata. For objects and arrays, PostgreSQL maintains counts and offsets to child elements.

PostgreSQL’s JSONB tokenisation is sneaky. It not only parses data but also preserves the actual data types of values. Each value in the binary format includes a type that identifies whether it’s a string, number, boolean, null, array, or object. Thanks to that PostgreSQL can:

ensure that data maintains its semantic meaning,
enable indexing,
enable type-specific operations (like numeric comparisons),
avoid type conversion when not needed.

What’s more, it’s not limited to JSON types. It distinguishes different numeric types (integers, floating-point) and handles string data with character encoding.

So, when you search for:

customer_data->'customer'->'contact'->'address'->>'city'

PostgreSQL can navigate directly to that specific token without scanning the entire document. It uses the described hierarchical structure, and gets the exact value with the type.

PostgreSQL uses path extraction operators (-> and ->>) for querying the data inside JSON.

When PostgreSQL encounters such query, it:

Parses the path into individual segments.
Locates the root object or array in the binary structure.
Each path segment computes a hash of the key.
Uses the hash to look up the corresponding entry in the structure.
Navigates to the next level if needed.
Extracts and returns the value in the requested format.

This algorithm is heavily optimised for typical access patterns. For simple path expressions, PostgreSQL can retrieve values with near-constant time complexity, regardless of document size. However, the performance characteristics become more variable for more complex expressions, especially those involving arrays or filtering.

The algorithm also has specific optimisations for handling arrays differently from objects. Arrays are stored with implicit numeric keys, and PostgreSQL includes specialised operations for array traversal and element access. This becomes important when dealing with large arrays or performing operations like jsonb_array_elements().

Understanding this internal path representation explains why some JSONB queries perform better than others. Operations that align with PostgreSQL’s internal traversal algorithm (like direct path access to deeply nested values) can be remarkably fast, while operations that require scanning or restructuring (like complex filtering within arrays) might perform less optimally.

Read more in the documentation about JSON Functions and Operators.

To see how JSONB actually works, let’s discuss a scenario that manages customer data.

Here’s how a typical customer model might look in TypeScript:

interface Customer {
  id: string;
  email: string;
  source: string; 
  name?: string;
  
  // Different structure depending on acquisition channel
  // At least one of these will be present per customer
  web_data?: {
    first_visit: string;
    utm_source: string;
    browser: string;
    pages_viewed: string[];
  };
  
  mobile_data?: {
    device: string;
    os_version: string;
    app_version: string;
    notification_token: string;
  };
  
  crm_data?: {
    account_manager: string;
    company: string;
    department: string;
    role: string;
    meeting_notes: string[];
  };
  
  contact: {
    phone?: string;
    address?: {
      city: string;
      country: string;
      postal_code?: string;
    };
  };
  
  purchases: Array

If you tried to model it with relational tables, you’d face a classic data modelling challenge. You’d end up with either:

A complex set of tables with joins: Customer_Web, Customer_Mobile, Customer_CRM with relationships between them.
A wide table with mostly NULL values: A single table with columns for every possible field across all sources.

Wit

PostgreSQL Jsonb – Powerful Storage for Semi-Structured Data by thunderbong

PostgreSQL Jsonb – Powerful Storage for Semi-Structured Data by thunderbong

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

PostgreSQL Jsonb – Powerful Storage for Semi-Structured Data by thunderbong

PostgreSQL Jsonb – Powerful Storage for Semi-Structured Data by thunderbong

Share This Article

Newsletter

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter