Avatar
Krishna Biradar
  • November 14, 2019

Deconstructing a git commit

I’ve spent the last few weeks diving deep into understanding Git and how it really works behind the scenes. At first, it felt a bit overwhelming, but once I started grasping the core ideas, like how Git’s 3-way merge and interactive rebase actually function, things started to make sense. I began to feel much more confident using Git.

Now that I’ve gotten the hang of it, let me take you on a little journey into the world of Git. Every git repo contains a .git directory and everything git tracks or is aware of is stored in this .git directory, and this is where Git keeps track of everything that’s happening. The moment you ask Git to manage a folder, it creates this .git directory, and that’s where all the magic happens.

Let’s take the first step and create a new Git repository with the git init command. Once we do, we’ll peek inside the .git directory to see what’s going on in there.

commit tree directory tree of a fresh git repo

I’ve given a quick overview of what each file and folder inside the .git directory does. Don’t worry if it’s still a bit fuzzy right now—that’s completely normal! Once we go through the process of making our first commit, everything will start to make sense.

So, let’s jump right in by making our very first commit and see how things change in the .git directory.

First, let’s create a file named file1.txt and add the text “hello world” inside. Then, we’ll commit it and see what happens!

commit tree file1.txt

Now let’s create our first commit by commiting file1.txt to git and see how it changes the .git directory.

 git add file1.txt
 git commit -m "Initial commit"

commit tree you might have files with different names, since these file names depend on your git configurations

After the commit, you might notice that three new directories have popped up in the objects folder, each containing a single file. So, what are these files all about, and why did Git create them?

In Git terms, these files are called object files. Git uses four different types of object files:

  1. Blob: This is where Git stores the actual contents of a file.
  2. Tree: This holds information about files and directories, such as their names and permissions. It also references other tree objects.
  3. Commit: A commit refers to a specific tree object (think of it as a snapshot of the project at a particular moment in time). It also stores metadata like the timestamp, author, and a reference to previous commits (the parent commit).
  4. Tag: A tag marks a specific commit as important. For example, tags are often used to highlight release versions.

If you try to peek inside any of these object files using the cat command, you’ll just see a bunch of gibberish. That’s because Git compresses the contents of these files with Zlib before storing them.

commit tree raw object files

You can decompress these files using any tool that supports Zlib. If you have OpenSSL installed, you can run the command openssl zlib -d < <fileName>. Or you can use any other tool that allows you to decompress with Zlib—whatever works best for you!

commit tree decompressed object files

If you calculate the SHA-1 hash of any of these object files after decompressing them, you’ll notice something interesting—the SHA-1 hash matches the file’s name, including the two-character prefix of the folder where the file is stored. This is how Git names and organizes its object files.

commit tree shasum of decompressed object files

This happens because Git uses something called content-based addressing, meaning it names its object files based on their content. This method of naming files by their content is what we call content-based addressing, and it’s one of the reasons Git is also known as a content-addressable filesystem.

Now, let’s break down what just happened when we made that commit and see if we can replicate the hashes ourselves. Creating a Git commit happens in two stages: staging and committing. But for simplicity, we’ll treat these two steps as one. (If you’re curious about why Git uses a staging area, you can read more about it here.)

When we create a commit, one of the first things Git does is create a blob object for each file in the repository. In our case, we only had one file—file1.txt. So, Git took the content of file1.txt (just the content, without any additional information like the file name or permissions), prefixed it with the word “blob,” followed by the file size and a NUL character. Then, Git calculated its SHA-1 hash and used that hash as the file name. Finally, Git compressed the content with Zlib before storing it in the objects directory.

commit tree the text blob is so that git can identify this is a blob object. 12 is the content size of this blob object, x00 is the hex code for NUL character (backslash is for escaping), followed by the actual contents of the file.

Let’s verify that this is true by running the following command.

commit tree

As you can see, the hash value matches one of the object names in the objects directory from our first commit. Git takes the first two characters of the hash as the directory name and uses the rest as the file name. This is a clever trick Git uses to prevent file system errors since many file systems have limits on the number of files you can store in a single directory.

So, that’s how Git stores the content of files, but how does Git know which file this content belongs to?

That’s where another type of object, called a tree object, comes in. A tree object is a simple structure that holds references (or pointers) to blob objects and other tree objects. Essentially, it represents the contents of a directory and includes additional details like file mode, the file or directory name, a reference to the SHA-1 of the blob or tree object, and some metadata.

In our case, we had just one directory—the working directory—with a single file in it, file1.txt. If we had more files or subfolders, they would all be represented in this tree object. And here’s how our tree object looks:

commit tree the text tree is so that git can identify this is a tree object. 37 is the content size of this tree object, x00 is the hex code for NUL character (backslash is for escaping), followed by a list of contents of the directory.

The first six characters after the NUL character (100644) represent the file mode, Git uses the following modes:

  • 100644 for a normal file
  • 100755 for an executable file
  • 040000 for a directory
  • 120000 for a symbolic link and after that comes the type of object whether tree, blob or commit followed by the SHA-1 hash of it and finally the file/directory name. An object referenced by a tree may be a blob, representing the contents of a file, or another tree, representing the contents of a subdirectory.

You can use the following command to recreate the tree hash:

Recreate tree hash Recreate tree hash

The tree object is a bit unusual. It doesn’t include a newline character at the end, which is why we need to pass the -n flag to echo—this tells echo not to add the newline, as it normally would. Additionally, the tree object stores the hash values in binary form. Why does Git do this? Well, nobody really knows for sure. It’s just one of those quirks of Git’s design .

Finally, Git creates a commit object to tie everything together. In our case, the commit object is d389f2a5b3c591dd7f1b286cdb50bc4d26021dce. This is the hash you see when you run the git log command. The commit object holds crucial information about the commit: who made the commit (the author), who applied it (the committer), the commit message, a reference to the tree object representing the state of the working directory, and a reference to the parent commit (if there is one). Since this is our first commit, there’s no parent—making it a root commit.

Now, let me show you how Git calculated the SHA-1 sum of this commit object..

the word commit is so that git can identify this is a commit object. 195 is the content size of this commit object, x00 is the hex code for NUL character (backslash is for escaping), followed by a reference to the root tree object and other information like author, commiter and the commit message. the text commit is so that git can identify this is a commit object. 195 is the content size of this commit object, x00 is the hex code for NUL character (backslash is for escaping), followed by a reference to the root tree object and other information like author, commiter and the commit message.

You can use the following command to recreate the commit hash: commit.png recreate commit hash

And this is how Git links commits to trees, and trees to blobs.

A common misconception is that Git stores only the changes between commits, but that’s not exactly how it works. Instead, each Git commit is like taking a snapshot of the entire working directory at a particular point in time. This snapshot consists of commit, tree, and blob objects. So, every time you make a commit, Git saves the entire contents of the working directory in the .git directory.

You might be thinking, “Isn’t that inefficient in terms of disk space?” Actually, no. Thanks to Git’s content-based addressing, it reuses the same blobs and tree objects across multiple commits if the contents of the files or directories haven’t changed. So, Git efficiently manages storage by avoiding duplication.

Here’s a visual representation of a sample repository, showing how Git connects commits, trees, and blobs together.

commit-graph.png This type of data structure is also called a merkledag, because its a combination of merkle tree and a DAG.

That’s all for this post. In the next one, we’ll dive into how Git handles branches, defines tags, and explores various Git configuration options. Stay tuned!

git
Buy Me A Coffee