Git Internals – Git Objects by naltun

Share This Article

Sed ut perspiciatis unde.

Git is a content-addressable filesystem.
Great.
What does that mean?
It means that at the core of Git is a simple key-value data store.
What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content.

As a demonstration, let’s look at the plumbing command git hash-object, which takes some data, stores it in your .git/objects directory (the object database), and gives you back the unique key that now refers to that data object.

First, you initialize a new Git repository and verify that there is (predictably) nothing in the objects directory:

$ git init test
Initialized empty Git repository in /tmp/test/.git/
$ cd test
$ find .git/objects
.git/objects
.git/objects/info
.git/objects/pack
$ find .git/objects -type f

Git has initialized the objects directory and created pack and info subdirectories in it, but there are no regular files.
Now, let’s use git hash-object to create a new data object and manually store it in your new Git database:

$ echo 'test content' | git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4

In its simplest form, git hash-object would take the content you handed to it and merely return the unique key that would be used to store it in your Git database.
The -w option then tells the command to not simply return the key, but to write that object to the database.
Finally, the --stdin option tells git hash-object to get the content to be processed from stdin; otherwise, the command would expect a filename argument at the end of the command containing the content to be used.

The output from the above command is a 40-character checksum hash.
This is the SHA-1 hash — a checksum of the content you’re storing plus a header, which you’ll learn about in a bit.
Now you can see how Git has stored your data:

$ find .git/objects -type f
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4

If you again examine your objects directory, you can see that it now contains a file for that new content.
This is how Git stores the content initially — as a single file per piece of content, named with the SHA-1 checksum of the content and its header.
The subdirectory is named with the first 2 characters of the SHA-1, and the filename is the remaining 38 characters.

Once you have content in your object database, you can examine that content with the git cat-file command.
This command is sort of a Swiss army knife for inspecting Git objects.
Passing -p to cat-file instructs the command to first figure out the type of content, then display it appropriately:

$ git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
test content

Now, you can add content to Git and pull it back out again.
You can also do this with content in files.
For example, you can do some simple version control on a file.
First, create a new file and save its contents in your database:

$ echo 'version 1' > test.txt
$ git hash-object -w test.txt
83baae61804e65cc73a7201a7252750c76066a30

Then, write some new content to the file, and save it again:

$ echo 'version 2' > test.txt
$ git hash-object -w test.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a

Your object database now contains both versions of this new file (as well as the first content you stored there):

$ find .git/objects -type f
.git/objects/1f/7a7a472abf3dd9643fd615f6da379c4acb3e3a
.git/objects/83/baae61804e65cc73a7201a7252750c76066a30
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4

At this point, you can delete your local copy of that test.txt file, then use Git to retrieve, from the object database, either the first version you saved:

$ git cat-file -p 83baae61804e65cc73a7201a7252750c76066a30 > test.txt
$ cat test.txt
version 1

or the second version:

$ git cat-file -p 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a > test.txt
$ cat test.txt
version 2

But remembering the SHA-1 key for each version of your file isn’t practical; plus, you aren’t storing the filename in your system — just the content.
This object type is called a blob.
You can have Git tell you the object type of any object in Git, given its SHA-1 key, with git cat-file -t:

$ git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
blob

Tree Objects

The next type of Git object we’ll examine is the tree, which solves the problem of storing the filename and also allows you to store a group of files together.
Git stores content in a manner similar to a UNIX filesystem, but a bit simplified.
All the content is stored as tree and blob objects, with trees corresponding to UNIX directory entries and blobs corresponding more or less to inodes or file contents.
A single tree object contains one or more entries, each of which is the SHA-1 hash of a blob or subtree with its associated mode, type, and filename.
For example, let’s say you have a project where the most-recent tree looks something like:

$ git cat-file -p master^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859      README
100644 blob 8f94139338f9404f26296befa88755fc2598c289      Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      lib

The master^{tree} syntax specifies the tree object that is pointed to by the last commit on your master branch.
Notice that the lib subdirectory isn’t a blob but a pointer to another tree:

$ git cat-file -p 99f1a6d12cb

Git Internals – Git Objects by naltun

Git Internals – Git Objects by naltun

Share This Article

Newsletter

Tree Objects

HackTech

Leave a comment Cancel reply

Editor's Choice

Git Internals – Git Objects by naltun

Git Internals – Git Objects by naltun

Share This Article

Newsletter

Tree Objects

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter