FREELessons: 24Length: 3.1 hours

Next lesson playing in 5 seconds

Cancel
  • Overview
  • Transcript

3.2 Git Internals

In this video I wanna give you a little bit of an inside look at how Git actually works. We're gonna talk about how it stores content and how it tracks the different versions of the code that you write. Let's start by checking out how it stores the actual content. I want you to think about our sample project or any project that might be using Git to track the different versions. And all of these projects are gonna have two types of content. They're gonna have files and they're gonna have directories, and Git has a special way to store files, and a special way to store directories. So let's look at these two different ways of storing content. First, Git uses a Blob to store the content of a file. Notice that I didn't say it uses it to store the file. It just stores the content of the file, what is inside the file? So this should raise two questions. First of all, what is a Blob? And second of all, what happens to the file's name, the file's permissions? How about the file's path? Where is all that information stored? Well, we're gonna come back to that second question in a minute. But first of all, what is a Blob? A Blob is actually a type of Git object. Git uses several different types of objects to store its content within get repository. So these objects are referenced by a SHA 1 hash. I mentioned this briefly when we were talking about commits if you recall that. Basically, to hash something is to convert its content into a 40 character string. And this 40 character string is going to be unique to the content that was passed into the hash function. In fact, if we changed a single character in that content then we would get a different SHA 1 hash out, and SHA 1 is not the only hashing algorithm there is but that is the hashing algorithm that Git uses. So it's this hash that uniquely identifies every piece of content within your Git repository. In fact, these hashes are actually used to store the Git objects in the Git object database. If you open up the .git folder within your project and look inside the objects folder, you should see something like this. As you can see, Git splits up that hash into part of it is used for a folder name, part of it is used for an actual file and that hash is used to uniquely identify the contents of these files. Now, what is actually in these files right here that you're looking at? These ones with the hashes as file names? Well it's not actually just the plain text content of the files. We're gonna talk about specifically what content goes into these but what you need to know is that it's actually been compressed using a data compression library called zedlib or ZLib. And it's this compression that keeps the .git repository small and to a reasonable size. So back to Blobs, a Blob is one of these types of Git objects that are stored as a hash and in those files. And what is stored within the blob file is a bit of header information and then the actual contents. So inside each one of these objects, if it is a blob object, if not one of the other types of objects then inside this blob is going to be a bit of header information and the contents of one of our files compressed using the ZLib compression library. So then we need to be able to store directories and for that, Git uses a tree. The tree object is created in a similar way to the blob object in that it's a bunch of content that is hashed and that hash is the ID for the file that has the compressed version of that content in it. However, the content that creates a tree is actually a little different than the blob content. First of all, there's some header information and then for every file and directory inside of that directory, there are four things listed. There's the file permissions or the modes. There's the object type whether it's a blob or a tree. So over there it's a file or a directory. There's the SHA-1 hash that identifies that file or directory. And then there is the file and directory name, the name that you give it. This is an example of a tree. I've actually taken the fake project that we've been working on throughout this course and used one of Git's more low level commands to actually print out the contents of one of these objects. As you can see there, the hash that I passed to the command is the hash of a specific tree Git object. One of those trees. And as you can see, it has four lines of content within it. It has the file modes. The types of object that it is, in that case, they're all blobs. It has the hash that would identify each of those files, and then it has the English name that we have given those files. So you can see that is how a tree is structured. One more type of object that we wanna talk about is a commit. And you know that a commit is how Git stores the quote snapshots that we take of our project as time goes on. Now, you have actually already made commits, so you have created these type of Git objects and they are in that Git object database inside of our project folder. Now of course, a commit is created in exactly the same way as the other objects since it is an object. It's the right content hashed. That hash becomes the identifier and the file name, and the contents that we originally hashed is compressed and put inside that file. So what contents are we hashing in the case of a commit? Well, we're hashing the author information, and the committer information. Now, in our cases that author committer information is actually the same information. However, you can sometimes have a different person author the code than the person who actually ends up committing the code. And so Git it allows you to keep that information separate. And of course, we have the commit message which is sometimes called the subject. And then, we have the SHA-1 hash for any of the parent commits. And a parent commit is just the commit that was made before this commit. You'll recall from our previous video where we talked about how the working directory is at its first state. The exact content that was made in the last commit or since the changes that you made in a commit are just changed from the last commit. It's important that Git be able to reference the commit that took place before this one. So included inside of that commit object is the SHA-1 hash of the parent commit. And actually, as you'll see when we get into branching and merging you can actually have multiple parents of a given commit. And we're gonna talk about this a bit more later in this screen cast. But when you have two branches that you're merging together, that merge commit actually has two parent commits. So there may be more than one parent SHA-1 within a commit object. Finally, is the SHA-1 of the tree that the commit points to. Now, a commit object is actually pointing to the tree object that represents the latest changes or the changes that were made for that given commit. But the commit object itself doesn't actually hold any of the content or changes that you actually made. It just points to a tree which just points to those changes. So let's see how exactly this works? How does Git track this content? So let's say that we have a little project here. We've created a README file and a library.js file inside of our folder and we've actually committed those. Now, it's obvious here that Commit 1 is commit object that we just talked about. But what might not be obvious is that the root directory is actually a tree object, and those two files are blob objects inside the Git database. So what you see here is a view of what's going on the Git database. Now remember, I just said that the commit object points to the tree which points to the files or other trees inside that file. So in this case, our Commit 1 here has all the commit information inside of it. And one of those bits of information is the hash of our tree here. In this case our tree's just the root directory of our project. And that tree object as you saw previously, will actually list out files that are inside of it and point to them the other hashes. So this is how Git stores a commit. Now, what happens when we make a second commit? Let's say that we changed some things in the library.js file and we actually recommit it. Well, we changed the library.js file, so that means we're gonna have to create another blob that represents the new library JS file. And since we created a new blob, that means that we're going to have a new hash for it. And since it has a new hash, that means we also need to create a new tree because that tree object that you see there has the reference to the old library object. Now, we don't get rid of the old library object or the old tree. We just create new ones. And when we create that new commit, the new commit points to that new tree object. So here you can see that Commit 2 points to a brand new tree and a brand new blob that represents our library.js file, and this represent the changes that have been made in that commit. We also need to represent that README file. But the README file hasn't changed. So the way get does it, is since the README file has not changed, the tree option that the commit points to points itself to the old README object because the hash for the README file hasn't not changed cuz the contents inside it hasn't changed. So the hash line in there is gonna be the same, and it's going to point to the same README file. So this would obviously continue as we add files, as we remove files, as we make more changes. The commit will always point to the tree that represents the full directory. And if a file or directory has not changed, we're just gonna point back to the latest version of it. However, if it has changed, when we create a commit we create that new tree or that new object. So this is the way the Git tracks changes. And as you can see, it's not actually tracking changes in that, it's not actually just creating a new blob that just holds the changes that were made since the last library.js file was committed. It's actually creating a new object that has every bit of content. You could take that second library.js blob right there, decompress it, and you actually have the actual contents of the latest library.js file. So it doesn't just hold the changes that were made. And this is actually one case where Git can be a little bit faster when you're doing some reverting operations because it doesn't have to add up all the changes between your latest commit and the commit that you want to revert back to and slowly roll back those changes. It just has to jump all the way back to that commit that you're trying to revert to and there is the whole contents of that file inside of the blob. So I wanna wrap up this video by talking about a couple of ways to reference a specific commits. And as you learn more about Git and more Git commands, you'll find several commands where knowing how to reference a specific commit is very important. So of course, the most obvious way to do it is using the SHA-1 Hash, and here's a little example diagram of that. As you can see, we have a commit, and the hash of that commit we'll point to the state of the project at that commit. So that's pretty straightforward. And I should mention that you don't always have to use the full 40 character string. Often, just using six or seven characters are gonna be enough to uniquely identify that point. So as long as you don't have to commits with the same first couple characters. As long as you take enough characters to uniquely identify that commit, you don't have to supply the whole 40 characters. So then there are branches. We're gonna talk about branches more in the next couple videos, but basically they are exactly what they sound like. They're a way to branch off and experiment with something different in your code without hurting the main branch. Actually, all this time that we've been working, we've been working on the master branch which is the default branch that Git creates for you, and Git keeps a reference for every branch. That reference just points to the latest commit that you made on that specific branch. So in this example you can see that the master branch reference actually points to a commit which points to the actual tree. There's actually a bit of a narrower version of branch which is called HEAD, in all uppercase letters. And HEAD is actually reference that Git keeps for you that points to the latest commit on the current branch. So when you're on the master branch, HEAD points to the latest commit on the master branch. When you're on my awesome feature branch, HEAD will point to the latest command on the my awesome feature branch. So finally, we have Ancestry References. And there are two types of Ancestry References that we're going to discuss here. And both of these are actually not really specific references in themselves, but they're actually a way to modify the references we have been talking about. So I'm gonna use HEAD in these examples. However, a branch name or a SHA-1 Hash will also work just as well but for appending the specific ancestry references. If you look at our little diagram here, you can see the HEAD is pointing to Commit 4. That was the most recent commit on our branch, and as you can see it goes back so that Commit 1 is the commit that we are farthest away from. That was the first commit in this project, Commit 4 is the latest commit. So as you can see, what I've done here is for Commit 3, I have reference pointing it saying HEAD~. Now, you can use a tilde to get a reference to that commit's parent. So HEAD points to Commit 4, the latest commit, but HEAD~ will point to the parent commit of Commit 4, which is Commit 3. Now of course, you can add a number after the tilde and you can get further back commits than even that. So HEAD~2, gets the grandparent commit of HEAD which in this case is Commit 4. So HEAD~2 references Commit 2, HEAD~3 references Commit 1. Then you can also use the carrot to get the parents of a merged commit. Now, you remember I talked about merge commits are when two branches that we have made actually merge back together. So in this case you can see that we have Commit 1 and that is a one branch, Commit 2 is on a separate branch. But after Commit 2, we actually wanted to merge these two branches back together and we did that in Commit 3. So what we actually did is merge the branch that had Commit 2 back into the branch that had Commit 1. You could say, Commit 1 is on the master branch, Commit 2 is on a specific feature branch and Commit 3 is where we merged the feature branch into the master branch. So Commit 1 is the master branch and that is actually where we did the commit from. Now, if that didn't make sense come back to this after you have watched the branching. And branching and merging videos, and then this will make a little bit more sense to you. And as you can see we use HEAD carrot to reference the parent Commit on the branch that we merged into. So if Commit 3 is HEAD which it is in this case HEAD carrot references Commit 1, HEAD carrot 2 references Commit 2 because that is the first commit of the second parent or the other branch. In this case, HEAD carrot and HEAD tilde would both equal Commit 1 because both of those are the priority parent of Commit 3. And that's it for our three videos. Now, we're ready to dive into some more practical Git commands.

Back to the top