TL;DR: use --depth 2
. Read on for why.
Shallow clones (can, but not necessarily must) defeat an important optimization. In your case this happens for the first push, but not for subsequent pushes. Other defeating cases can occur so other pushes might also be slow.
We start with the fact that Git is really all about commits,1 which are shaped into a Directed Acyclic Graph. The graph has vertices or nodes—whichever term you prefer—that are numbered, by commit hash IDs. Relatively immaterial here, but helpful for concreteness, is the fact that the edges / arcs between the nodes are stored as part of the nodes themselves, rather than being kept separately. Each node stores the hash IDs of its predecessor nodes.
A repository is, at heart, a database of these commit objects. A complete—non-shallow—repository has the entire graph, from every root to every tip commit. A single-branch clone potentially drops some part of the graph, but never has any “gaps” in the graph. For instance, given:
node--node--tip1
/
root--node
node--node--tip2
we can drop either tip and the nodes on that row, but not the nodes and root on the middle row. In all of these cases, then, we can—as Git always does—start at the tip and work backwards and eventually arrive at the root.
Now, there are two properties of each node that are important here:
-
The number is unique. It’s a universally unique ID. No node in any other Git repository (that we’ll meet anyway) can re-use that ID.
-
The data in the node are strictly read-only. That includes the outgoing edge links.
What this means is that if we have a gap-free repository—one that’s either totally complete, or at least as complete as required for the tip commits it contains—on each side of a sender-to-receiver operation, we can have the sending repository simply enumerate for us some set of commits, by their numbers. If we, the
12 Comments
rafaelcosta
I'm wondering what the "because when we read it in, we mangle it" part really means… does this mean that there's no way to reference the commit (signaling that it's just a reference and has no actual data) without actually reading the contents of it?
— Update: just realized why it wouldn't make sense: `git push` would send only the delta from the previous commit and the previous commit is… non-existent (we only know it's ID), so we'd be back in square 1 (sending everything).
edflsafoiewq
Do blobless clones suffer from this?
necovek
I like the fact that none of this was tested, even if described with such authority :)
Anyone try it out yet?
(Not that I don't trust it, but I usually fetch the full history locally anyway)
nopurpose
`git clone –filter blob:none` FTW
Timwi
This seems like a bug to me. Even if the previous commit is “mangled” as they call it, there's no reason why you can't diff against it and only send the diff.
haunter
Wait is this a bug actually?
jbreckmckye
Why can't git push, when it encounters a `.git/shallow`, just ask the git server to fill in the remaining history by verifying the parent hashes the client can send?
zeristor
Should have the (2021) suffix
bradley13
Ok, I'm a simplistic Git user, but: I always do a full clone. Maybe (probably) I will never need all that history, but…maybe I will. Disk space is cheap.
mg
Reading this again reminds me of the fact how beautifully git uses the file system as a database. Where everything is laid out nicely in directories and files.
Except for performance, is there any downside to this?
In other words: When you store data in an application that only reads and writes data occasionally, is it a good idea to use the git approach and store it in files?
wvh
That's a beautiful answer. Sometimes people explain something you already know, but different parts of your brain light up. This doesn't just explain git once more, but also plants some seeds related to hashed state optimisations in other, future challenges.
kruador
It isn't mangled. The commit is there as-is. Instead the repository has a file, ".git/shallow", which tells it not to look for the parents of any commit listed there. If you do a '–depth 1' clone, the file will list the single commit that was retrieved.
This is similar to the 'grafts' feature. Indeed 'git log' says 'grafted'.
You can test this using "git cat-file -p" with the commit that got retrieved, to print the raw object.
> git clone –depth 1 https://github.com/git/git
> git log
commit 388218fac77d0405a5083cd4b4ee20f6694609c3 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Junio C Hamano <gitster@pobox.com>
Date: Mon Feb 10 10:18:17 2025 -0800
> git cat-file -p 388218fac77d0405a5083cd4b4ee20f6694609c3
tree fc620998515e75437810cb1ba80e9b5173458d1c
parent 50e1821529fd0a096fe03f137eab143b31e8ef55
author Junio C Hamano <gitster@pobox.com> 1739211497 -0800
committer Junio C Hamano <gitster@pobox.com> 1739211512 -0800
The ninth batch
Signed-off-by: Junio C Hamano <gitster@pobox.com>
I can't reproduce the problem pushing to Bitbucket, using the most recent Git for Windows (2.47.1.windows.2). It only sent 3 objects (which would be the blob of the new file, the tree object containing the new file, and the commit object describing the tree), not the 6000+ in the repository I tested it on.
It may be that there was a bug that has now been fixed. Or it may be something that only happens/happened with GitHub (i.e. a bug at the receiving end, not the sending one!)
I note that the Stack Overflow user who wrote the answer left a comment underneath saying
"worth noting: I haven't tested this; it's just some simple applied math. One clone-and-push will tell you if I was right. :-)"