Dan Negrey - Git Fundamentals

Of all the tools I use as a data scientist, the one that I cherish the most is Git. As a free and open source distributed version control system, Git plays an integral role in my work by seamlessly fostering many of the most important considerations of data science workflows including collaboration, experimentation, reproducibility and of course, source code management.

At their core, version control systems (VCS) all serve one broad and common purpose: tracking changes to files. What distinguishes one system from another, however, is how that purpose is implemented and what additional features are present. To get a better understanding of the history and evolution of version control systems, I recommend reading the introduction at Ry’s Git Tutorial. You might also want to bookmark his tutorial and work your way through all of the sections as he does a terrific job demonstrating Git’s feature set in much greater detail than what I’m covering in this post.

Before we move on, let’s clarify something that often comes up when people first start learning about Git. Many of you may have heard of companies like GitHub, GitLab or Bitbucket. These are each examples of web-based repository hosting services. Git itself is just a lightweight command line tool. Services like GitHub provide software development platforms that center around the use of Git but add a rich suite of additional features. The focus of this post is on learning the fundamentals of the Git command line tool.

Prerequisites

A basic understanding of Linux (Unix-like) commands is assumed. Specifically, this post makes extensive use of the following:

mkdir: create new directories
cd: change the current working directory
echo: display a line of text
cat: concatenate files and print on the standard output
ls: list directory contents
rm: remove files or directories

For brevity, I am using echo with redirection (>) to write files. In reality, you’d be using a visual editor such as vi to write and edit files.

Create a Local Git Repository

Before we can do anything with Git, we must initialize a directory as a Git repository. Let’s do so in a brand new directory that we’ll create called fundamentals underneath our home directory (~):

cd ~
mkdir fundamentals
cd fundamentals
git init

Initialized empty Git repository in /home/dan/fundamentals/.git/

Add a README File

The first thing you’ll want to do with any new Git repository is add a README.md file to the project root. As a plain text file, it will be the easiest place to save and read notes about your project. The “.md” extension indicates that it is a markdown file. Markdown is a text-to-HTML conversion tool that allows you to create easy-to-read and easy-to-write plain text files which get converted to HTML. Hosting services like GitHub and GitLab will automatically render your README.md file to HTML at the main repository site (for example: https://github.com/rstudio/blogdown). Here is a brief primer on some of the more commonly used markdown syntax:

Headers

# is an h1 header
## is an h2 header
### is an h3 header (and so on)

Regular Writing

Regular writing becomes a <p> tag

Inline Code

Enclose inline code in `single ticks`

Unordered Lists

* item one in unordered list
* item two in unordered list
* item three in unordered list

Italics and Bold

*italics*
**bold**
***bold-and-italics***

Hyperlinks

[hyperlink-alt-text](hyperlink-href)

Code Chunks

    Indent four spaces for a code block

Ordered Lists

1 item one in ordered list
2 item two in ordered list
3 item three in ordered list

Now let’s actually create our README.md file. A popular convention (and one that I use) is to have the first line of your README.md file be an <h1> header with the name of your repository:

echo \# fundamentals > README.md
cat README.md

# fundamentals

Git Status

Now that we actually have a file in our repository, we are ready to use Git. The one command that you’ll find yourself using regularly in order to check the status of your project and see what changes have occurred since the last “clean” state is git status:

git status

On branch main

No commits yet

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    README.md

nothing added to commit but untracked files present (use "git add" to track)

Notice the response from our command. It lists our README.md file as an untracked file. This is Git’s way of telling you that a new file is present in the repository. It also says to use "git add" to track.

Staging

Git allows you to review your changes before they get recorded into version control. This is called staging. You can add or remove files from the current staging area (“snapshot”) using git add and git rm:

git add README.md
git status

On branch main

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
    new file:   README.md

Now, README.md is being tracked and is ready to be committed into version control.

Git Commit

When you are ready to officially record (“commit”) your staged changes, use the git commit command. Doing so will prompt you for a commit message (all commits get accompanied by a message), however, you can avoid the prompt and supply the message with the commit by using the -m option:

git commit -m "initial commit"
git status

[main (root-commit) 33276a2] initial commit
 1 file changed, 1 insertion(+)
 create mode 100644 README.md
On branch main
nothing to commit, working tree clean

This will commit any staged files. Note that each commit is given a unique identifer known as a SHA-1 hash.

Git Log

Use the git log command to print a summary of all the commits that you’ve made. This will include the full commit hash, author, date and message. For an abbreviated result, use the --oneline option which will only print out the commit message and the first 7 characters from the commit hash:

git log

commit 33276a2309d5b5347a72754220d7fdf3320617ca
Author: Dan Negrey <dnegrey@gmail.com>
Date:   Tue Feb 7 20:51:18 2023 -0500

    initial commit

git log --oneline

33276a2 initial commit

.gitignore

There may be instances when you want Git to ignore certain files in your repository. A good example of this includes working data files that your code produces while it is executing. Generally, if you subscribe to the principles of reproducible research, then you should be able to ignore any ouput files that your code produces as your code should be able to reproduce the output when needed. For Git to ignore certain files, you’ll need to create a .gitignore file in your project root, and list in it the file names or patterns that describe what is to be ignored.

Let’s confirm a clean working directory:

pwd

/home/dan/fundamentals

ls -l

total 4
-rw-r--r-- 1 dan dan 15 Feb  7 20:51 README.md

git status

On branch main
nothing to commit, working tree clean

Now, let’s create a few data files that we’ll want to ignore:

for i in 1 2 3; do
    touch data${i}.csv
done
ls -l

total 4
-rw-r--r-- 1 dan dan  0 Feb  7 20:51 data1.csv
-rw-r--r-- 1 dan dan  0 Feb  7 20:51 data2.csv
-rw-r--r-- 1 dan dan  0 Feb  7 20:51 data3.csv
-rw-r--r-- 1 dan dan 15 Feb  7 20:51 README.md

git status

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
    data1.csv
    data2.csv
    data3.csv

nothing added to commit but untracked files present (use "git add" to track)

Next, we simply create our .gitignore file with the correct pattern to ignore these new data files:

echo data\*.csv > .gitignore
cat .gitignore

data*.csv

git status

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
    .gitignore

nothing added to commit but untracked files present (use "git add" to track)

ls -la

total 20
drwxr-xr-x  3 dan dan 4096 Feb  7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb  7 20:51 ..
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data1.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data2.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data3.csv
drwxr-xr-x  8 dan dan 4096 Feb  7 20:51 .git
-rw-r--r--  1 dan dan   10 Feb  7 20:51 .gitignore
-rw-r--r--  1 dan dan   15 Feb  7 20:51 README.md

Great! Git will now ignore any file in our repository that matches the pattern data*.csv. However, it recognizes that we have introduced a new file - namely, the .gitignore file. So we simply add .gitignore to the staging area and commit:

git add .gitignore
git commit -m "added .gitignore"
git status

[main 7d1bece] added .gitignore
 1 file changed, 1 insertion(+)
 create mode 100644 .gitignore
On branch main
nothing to commit, working tree clean

git log --oneline

7d1bece added .gitignore
33276a2 initial commit

Discard Changes

Let’s add a new file to our project:

echo "Hello, world!" > file1
cat file1

Hello, world!

git add file1
git commit -m "added file1"

[main 337001e] added file1
 1 file changed, 1 insertion(+)
 create mode 100644 file1

git log --oneline

337001e added file1
7d1bece added .gitignore
33276a2 initial commit

git status

On branch main
nothing to commit, working tree clean

Now, this is where things start to heat up! Let’s make a change to file1:

echo "Goodbye, world:(" > file1
cat file1

Goodbye, world:(

git status

On branch main
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   file1

no changes added to commit (use "git add" and/or "git commit -a")

Based on the above response from git status, Git is aware that file1 has changed. But, recall staging! This change has not yet been staged. Sometimes, changes like this might occur by accident or they may no longer be desired. To undo unstaged changes to a file, use the git checkout -- command:

cat file1

Goodbye, world:(

git checkout -- file1
cat file1

Hello, world!

git status

On branch main
nothing to commit, working tree clean

Voila! Everything is back to the way it was before we changed file1.

Git Revert

In some cases, it will be necessary to undo an entire commit. To do so, use the git revert command and supply it with the SHA-1 hash of the commit that you would like to revert. Suppose we didn’t want a .gitignore file. We would want to revert our second commit:

git log --oneline

337001e added file1
7d1bece added .gitignore
33276a2 initial commit

 git revert 7d1bece

[main 3d77b78] Revert "added .gitignore"
 Date: Tue Feb 7 20:51:18 2023 -0500
 1 file changed, 1 deletion(-)
 delete mode 100644 .gitignore

git status

On branch main
Untracked files:
  (use "git add <file>..." to include in what will be committed)
    data1.csv
    data2.csv
    data3.csv

nothing added to commit but untracked files present (use "git add" to track)

git log --oneline

3d77b78 Revert "added .gitignore"
337001e added file1
7d1bece added .gitignore
33276a2 initial commit

ls -la

total 20
drwxr-xr-x  3 dan dan 4096 Feb  7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb  7 20:51 ..
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data1.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data2.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data3.csv
-rw-r--r--  1 dan dan   14 Feb  7 20:51 file1
drwxr-xr-x  8 dan dan 4096 Feb  7 20:51 .git
-rw-r--r--  1 dan dan   15 Feb  7 20:51 README.md

Recall that the point of Git is to track all of your changes! Notice that git revert did not simply roll back or remove the specified commit. Instead, it created a new commit reflective of the state we desired. In fact, by removing our .gitignore file, Git is now aware of the data files that it was previously ignoring! We can now revert our revert to get back our .gitignore file:

 git revert 3d77b78

[main d19f5b3] Revert "Revert "added .gitignore""
 Date: Tue Feb 7 20:51:18 2023 -0500
 1 file changed, 1 insertion(+)
 create mode 100644 .gitignore

git status

On branch main
nothing to commit, working tree clean

git log --oneline

d19f5b3 Revert "Revert "added .gitignore""
3d77b78 Revert "added .gitignore"
337001e added file1
7d1bece added .gitignore
33276a2 initial commit

ls -la

total 24
drwxr-xr-x  3 dan dan 4096 Feb  7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb  7 20:51 ..
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data1.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data2.csv
-rw-r--r--  1 dan dan    0 Feb  7 20:51 data3.csv
-rw-r--r--  1 dan dan   14 Feb  7 20:51 file1
drwxr-xr-x  8 dan dan 4096 Feb  7 20:51 .git
-rw-r--r--  1 dan dan   10 Feb  7 20:51 .gitignore
-rw-r--r--  1 dan dan   15 Feb  7 20:51 README.md

Closing Remarks

With just a few simple commands, you’ve taken your first step into a larger world! The use of version control, and more specifically Git, may be a total paradigm shift for you. It may seem challenging to learn and difficult to incorporate into your everyday workflow. But over time, this will subside, you’ll learn to work in far more efficient ways than you did previously and Git will become an indispensable tool that is integral to your approach. And we’ve only just scratched the surface. Git’s branching model is one of its key differentiators and will be next up on our list along with remote repositories, so stay tuned to the blog by following me on Twitter @NegreyDan!