how_does_git_work.tex

\section{How does Git work internally?}\label{howdoesgitwork}

To understand how Git works we need to know some basic commands.

\subsection {Basic local commands}

Just like any other VCS Git stores its data in the repository directory.
There are two major ways to get a repository.
One would be by simply creating a new empty Git repository in a directory of choice:
\begin{lstlisting}
$ git init
\end{lstlisting}
This command creates a new subdirectory .git/ where all the version control
data is stored. \cite[page 55]{gitinternals2008} \\
The second is to clone an existing repository over an URL. Possible protocols
are FILE, SSH, HTTP(S), or GIT. Of course the target location must host a
Git repository:
\begin{lstlisting}
$ git clone 
    git://github.com/rails/rails.git
\end{lstlisting}
Git will clone the whole Ruby on Rails project from github and we are
ready to work. \cite[page 56]{gitinternals2008} \\

After we have a new repository we can add some content:
\begin{lstlisting}
$ touch README
$ git status
$ git add README
$ git commit -m "Added README file"
\end{lstlisting}

This code produces an empty README file. The status command displays any change
of the repository. There we can see that the README file has been added and we
track this file simply by telling Git to add the file to the index. Additionally
Git now stages the file, but we are going to cover the staging feature in the next
section. Then we commit the changes with a short and descriptive message. After
this, Git stores the changes into the repository and points the default master
branch to the new commit. \cite[page 55]{gitinternals2008} \\

\subsection {History}

With every commit Git will append information to the history of the project. This history
can be browsed using the \emph {git log} command:

\begin{lstlisting}
$ git log
commit 8133a...
Author: User <user@provider.org>
Date:   Wed Nov 17 19:29:03 2010 +0100

    [commit message]
    
commit 3131f...
Author: User <user@provider.org>
Date:   Wed Nov 17 19:29:03 2010 +0100

    [commit message]
    
...
\end{lstlisting}

Each commit consists of four parts: the commit SHA1 string, the author with his E-Mail address, the commit date and a short descriptive message.
But this is not all that can be seen in the history. With several additional commands the history can be formatted or all the changes can be displayed.
As an example this code formats the history quite differently:

\begin{lstlisting}
$ git log 
   --pretty=format:"%h, %ar : %s"
8133ae1 4 days ago : fix issue 8645
741a1bc 4 days ago : merged crazyidea
3131fd0 4 days ago : added scripts
\end{lstlisting}

There are several more formatting place holders that can be used. \cite[chapter 2.4]{gitpro2009} \\

As an example of the capability of the log command, there is a shell script\footnote{https://github.com/d1rk/git-ftp on 21.11.2010} on GitHub that allows to upload a Git repository over the FTP protocol. This script uses the \emph{git log} command to get enough information to upload the changes to the FTP server. 

\subsection{Git over network}

After there has been committed a reasonable amount of work in the local Ruby on Rails
repository we can share the work telling Git:

\begin{lstlisting}
$ git push origin master 
\end{lstlisting}

This command will only succeed if the user has the appropriate access rights to the
project. But if we had them, the master parameter tells Git to take the default
branch and transmit the changes to the host. We could also specify any other
branch and transmit it to the destination host. \\ 

Also one should know that the origin parameter is our destination. It is possible to
add many different socalled remotes:

\begin{lstlisting}
$ git remote add [name] [url]
\end{lstlisting}

This command links a short name to any URL and the work can be shared over any network Git is designed for. \cite[chapter 2.9]{gitpro2009}  \\

The last very important command is:
\begin{lstlisting}
$ git fetch origin
\end{lstlisting}

The fetch command is used to get work from other people of the project since
the last fetch. Very important is that this command does not
automatically merge the new updates into the local master branch. It just
downloads the updates and creates a new branch \emph{FETCH\_HEAD}. Then this branch \emph{FETCH\_HEAD} has to
be manually merged into the desired destination branch. 
If this feature is not needed following command can be executed:
\begin{lstlisting}
$ git pull origin
\end{lstlisting}
Git will automatically merge the changes into the master branch.
\cite[chapter 2.9]{gitpro2009} \\

\subsection {Staging}

Now that we covered the fundamental basics of Git, we can take a deeper look at
some parts of the logic used by Git. One high level feature of Git is the
\emph{interactive adding}. We create a new repository, add a file and commit
it. If we modify the same file again and try to commit with the command above,
Git will not apply the modified file to the repository. This is because this file is
not explicitly added to the socalled staging mode. The modified file has to be explicitly added to the staging mode and has to be commit. This feature gives much
more control over what goes into the repository or not. This feature is often
ignored and it can be simply committed a unstaged file with this command:

\begin{lstlisting}
$ git commit -a -m "message"
\end{lstlisting}

The parameter \emph{-a} automatically adds the file to the staging mode and
commits it. \cite[chapter 2.2]{gitpro2009}

The picture\footnote{http://progit.org/figures/ch2/18333fig0201-tn.png accessed on 
09.11.2010} taken out of the Pro Git book illustrates how files change their
lifecycle modes:

\begin{figure}[h]
  \centering
  \includegraphics{img/file_status_lifecycle}
  \caption{File status lifecycle}
  \label{fig:File status lifecycle}
  
\end{figure} 

As we can see there are four modes a file can have. First we have to track them
once to get them into the lifecycle. If we modify we can use \emph{-a} on the
commit command to automatically stage and commit them, or we can take alot more
control of our repository and stage them manually and avoid mistakes and
redundant changes that do not belong to a certain commit.
We can simply get information about the mode using the Git command
\emph{status}. \cite[chapter 2.2]{gitpro2009} 

\subsection{How does git store its data?}
Git uses a unique storage model compared to other common DVCS like Baazar or Mercurial. 
In traditional distributed or centralized version control the system keeps differences between revision A and B. 
These changes are saved and marked with a revision number. 
This way one can, at any time, switch back to any revision and rebuild the complete content.
However, this does have some major downsides which we will see later. \\
Git handles it quite differently. It takes socalled \emph{snapshots} in every
new version. So if file A changes in the repository the next commit Git will take a new snapshot of this file and all other files. 
To reduce overhead unchanged files are linked to the identical file in the previous version. 
It creates a new mini filesystem rather than a VCS, enabling all the features centralized version 
control would give and gives some additional on top of it. \cite{gitpro2009} 

\subsection{Checksums for integrity}

\zitat{ Every single piece of data, when Git tracks your content, we compress it,
we delta it against everything else, but we also do a SHA1 hash of the content
and we actually check it when we use it. If you have discorruption, if you have
DRAM corruption, if you have any kind problems at all Git will notice them. It is
not a question of if, it is a guarantee } \cite{googletechtalk2007}

Git is secure. As we can see in Linus Torvalds speech, Git uses the SHA1 hash
as an integrity feature. Secure Hash Algorithm 1 is a string with 40 hexadezimal
characters. It creates a checksum of two pieces of data and makes it nearly
impossible that 2 files have the same checksum. This makes discorruption or loss
of data impossible without notice. \cite[chapter 1.3]{gitpro2009}

\subsection {Branching}

Branching is a major feature in Git. But also like the staging mode,
branching can be skipped and it can just be sticked to the master
branch. \\
Assumed we have a new feature request for the Ruby on Rails project called
newfeature we add a branch:
\begin{lstlisting}
$ git branch crazyidea
$ git checkout crazyidea
\end{lstlisting}

To understand what branching does technically we have to take a deeper look at
what a commit does internally: \
After executing the commit command Git creates a new commit object that
contains a pointer to the snapshot of the content staged, some metadata and
pointers to previous commits. If there are zero previous commits no pointer
is stored and if a branch is merged the commit can point to multible previous commits.

After staging a single file, Git creates a checksum and stores the version in the repository. Additionally it adds the
checksum to the staging area. This file is now a blob object.
When executing the commit command, Git will create checksums for each
subdirectory and adds this tree object to the repository. Finally Git creates
the commit object that contains the metadata and a pointer to the root object
tree. This is necessary to be able to recreate the snapshot.

We have three new objects in our repository describing the part of the history
of the project. There is one blob for the file we staged, the tree object that
points to the blob and a commit object that points to that tree object including
all metadata, like commit message and author. 
\\

Lets see what the releationship between these objects looks like:

\begin{figure}[h]
  \includegraphics[width=0.4\textwidth]{img/branch2}
  \caption{Git objects after a commit}
  \label{fig: git objects after a commit}
  %\author{Richard Plangger}
\end{figure}

It is obvious that every commit looks like this and every commit, exept the first
has one or more parents.
A branch is just a moveable pointer towards a commit. Also the master branch is
one of those ligthweight pointers towards a commit. In addition there is a
pointer called \emph{HEAD} which specifies which branch we are currently working on. So by
executing the command \emph{git checkout [branchname]} we move the \emph{HEAD}
pointer. 

Now it is time to make a new commit in the crazyidea branch and switch back to
the master branch and also commit some changes.

\begin{figure}[h]
  \centering
  \includegraphics{img/branch3}
  \caption{The new branch crazyidea}
  \label{fig: a new branch}
\end{figure}

There has been created a new commit pointing to one commit of the master branch.
The new commit in its own branch now contains the history from the master branch
to the commit where it was splitted from and the changes in the new
branch. The last two commits that have been made in the master branch are not
visible in our new branch. 

Because a new branch in Git is equal to a single
file with the SHA1 checksum branches are very cheap. In other version control systems branching means to
copy the whole content into a new folder and not just to create a small pointer. This copy process can take seconds or even minutes and discourages developers to use
this powerful feature. \cite[chapter 3.1]{gitpro2009}

\subsection {Merging}

Merging is the need of putting together data from different branches. Maybe one
day our crazyidea will turn out to be brilliant and also the crazyidea is
stable enough to put it to our master branch we need a way to put them together.
At this point the merge command pops in:

\begin{lstlisting}
$ git merge [branchname]
\end{lstlisting}

Git now merges all differences into the branch we are currently on and we can see
the result in figure \ref{fig:crazyidea_merge}.
\begin{figure}[h]
  \centering 
  \includegraphics{img/merge1}
  \caption{Merged the crazy idea into the master branch}
  \label{fig:crazyidea_merge}
\end{figure}

After the merge all data that is not contained in master but in crazyidea is put
together and will be part of master. As we can see this is a very powerful feature
that allows developers to be flexible and work on several different parts of the
project, without breaking the whole project. \cite[page 30-31]{gitinternals2008}

\subsection {What makes Git a DVCS?}

Linus Torvalds did quite a good job to summarize what distributed means:

\zitat{Being distributed very much means that you do not have one
central location that keeps track of your data, no single place is more important
than any other place \ldots } \cite{googletechtalk2007}

Nearly all commands that Git offer are local operations. There is no need to
have any network connection at all to work on a project and track changes
afterwards. Moreover the whole history of a project is contained in a
local repository. Also all contributors of one project have that history if
they pull from each other. Hence no repository has more information about the project
than the other.