Published 15 November 2019
I've been spending my last couple weeks in getting to understand git better and how it works under the hood. Once I understood the inner workings, things like git's 3-way merge and interactive rebase became much clearer to me and I felt more confident with git. So, without wasting any further time, let's get started in understanding how git works.
Every git repo contains a .git directory and everything git tracks or is aware of is stored in this .git directory. The .git directory is where all the magic happens. Let's create a brand new git repository and have a look at what's inside the .git directory. If you list the files in the .git directory, you will see a similar directory structure.directory tree of a fresh git repo
I have briefly explained what each file/directory is about, if some or most of it doesn't make much sense to you then do not worry. After we go through the process of making a commit, things will be more clear. In this post, we will focus on the object's directory and look at the other directories in the next post. Let's make our first commit and see if we notice any changes in our .git directory. Let's create a brand new git repo and create a file named file1.txt with the content "hello world" in it.file1.txt
Now let's create our first commit by commiting file1.txt to git and see how it changes the .git directory.
1. git add file1.txt 2. git commit -m "Initial commit"
As you can see, after the commit there are 3 new directories in the objects directory with one file in each one of them. Now, let me explain what these three files are about and how and why did git created them. In Git's term, these three files are called object files. There are four types of object files in git:
If you actually look at the contents of any of these object files with the
cat command, you will see gibberish, that is because git compresses all the contents of the object files with Zlib before storing them.
You can decompress them using any tool which will allow you to decompress
with zlib. If you have go installed on your machine then you can use
by Kevin Cantwell or with openssl by running
openssl zlib -d <fileName> or any other tool which will allow you to decompress with zlib.
If you try to calculate the SHA-1 hash of any of these object files after decompressing them, you will notice that their SHA-1 hash is the same as their file name prefixed by the two character folder name.shasum of decompressed object files
This is because git uses content-based addressing i.e. git names its object files based on their content. This way of naming files based on their content is called content-based addressing .
If you don't have any tools to decompress with zlib then you can also use the Git's built-in command cat-file command to view the contents of object files, but this will not show you the contents of the object files exactly as they are, but just the main content without any metadata.
git cat-file -p <hash-of-the-objectFile>
You can find the type of any object by using the following command:
git cat-file -t <hash of object file>
Lets find out what these three objects types are:
├── 3b │ └── 18e512dba79e4c8300dd08aeb37f8e728b8dad - blob ├── 82 │ └── 424451ac502bd69712561a524e2d97fd932c69 - tree ├── d3 │ └── 89f2a5b3c591dd7f1b286cdb50bc4d26021dce - commit ├── info └── pack
Now let's go through step by step what just happened when we did this commit and see if we can replicate these hashes ourselves. Creating a Git commit consists of two stages i.e. staging and committing. But we will look at these two steps as a single step just for the sake of simplicity. If you would like to know why git uses a staging area, you can read more about it here.
When we create a commit, the first thing git does is create a blob object for each file in the repository. In our case we only had one file i.e. file1.txt so git took the contents of file1.txt (remember, only the content of the file and no any other information like file name, permissions etc) and prefixed it with the word "blob" followed by the file size and NUL character and used its sha1hash as the file name and finally git compressed the content with zlib before storing them in objects directory.the text blob is so that git can identify this is a blob object. 12 is the file size, x00 is the hex code for NUL character (backslash is for escaping), followed by the actual contents of the file.
Let's verify that this is true by running the following command.image_caption
As you can see it matches with our blob objects name. Git uses the first two characters as the directory name and the remaining as file name, this is to avoid any file system related errors. (Many file systems have a limit on how many files you can have in a single directory).
Ok, so that is how git stores the contents of the files, but how does git know what is the file name that this content belongs to, well for that git creates another type of object called tree object. A tree is a simple object that has a bunch of pointers to blobs and other trees - it generally represents the contents of a directory or subdirectory at that particular commit and other information like file permissions, type of file (i.e whether it is a directory or ordinary file), etc. Let's look at how a tree object looks like using the git's built-in cat-file command.
git cat-file -p <hash of the tree object> >> 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad file1.txt
As you can see, a tree object contains a list of entries, each with a mode, object type, SHA1 name, and name. In our case, we only had one entry in the tree object since we only had one file, in case we had any other files or folders they would appear in this tree object. A tree object represents the contents of a single directory. ( In our case we only had one directory, which is the working directory. If we had subdirectories with files in them, git would have created a tree object for each one of those directories )
The first six characters represent the file mode, Git uses the following modes:
And after that comes the type of object whether tree, blob or commit followed by the SHA-1 hash of it and finally the file/directory name. An object referenced by a tree may be a blob, representing the contents of a file, or another tree, representing the contents of a subdirectory.
you can use the following command to recreate the tree hash:image_caption
The tree object is a bit weird, in it, it does not include a new line character at the end (that is why we have to pass the -n flag to echo to tell echo to not include newline character at the end, by default echo appends a new line character at the end) and also stores the hash values in binary. Why is that? Nobody knows.
Finally, git creates a commit object to tie everything together. In our case the commit object is d389f2a5b3c591dd7f1b286cdb50bc4d26021dce. This is what you see with the git log command. The commit object contains information about the commit like the author, committer, commit message, reference to the root tree object and a reference to parent commit(In our case we don't have a parent commit since this is our first commit, which is also called as root commit). Let's look at what is inside our commit object with git's cat-file command.
git cat-file -p <commit object hash> tree 82424451ac502bd69712561a524e2d97fd932c69 author kr5hn4 <email@example.com> 1565912913 +0530 committer kr5hn4 <firstname.lastname@example.org> 1565912913 +0530 Initial commit
As you know the cat-file command does not show any metadata in the object files. Let me tell you how git calculated the SHA-1 sum of the commit object.commit object
You can use the following command to recreate the commit hash:commit hash
And this is how git links commits to trees and trees to blobs.
Many people like to think that git just stores the file changes between commits, but that is not the case. A git commit is like taking a snapshot of the working directory at a particular point in history and this snapshot is a copy of the entire working directory which comprises of commit, tree and blob objects. What that means is every time you do a commit git will store the entire contents of the working directory at the time of the commit in .git directory. Now, you might wonder, isn't that inefficient disk space-wise? well not really, because since git uses content-based addressing, git will use the same blobs and tree objects for multiple commits if the contents of the file/directory have not changed.
Here is a visual representation of a sample repo showing how git connects commits, trees and blobs together.This type of data structure is also called a merkledag, because its a combination of merkle tree and a DAG.
That's all in this post, in the next post we will look at how git maintains branches, define tags, and various git configuration options.
Full stack developer