cd ~
mkdir fundamentals
cd fundamentals
git init
Initialized empty Git repository in /home/dan/fundamentals/.git/
Dan Negrey
March 16, 2017
Of all the tools I use as a data scientist, the one that I cherish the most is Git. As a free and open source distributed version control system, Git plays an integral role in my work by seamlessly fostering many of the most important considerations of data science workflows including collaboration, experimentation, reproducibility and of course, source code management.
At their core, version control systems (VCS) all serve one broad and common purpose: tracking changes to files. What distinguishes one system from another, however, is how that purpose is implemented and what additional features are present. To get a better understanding of the history and evolution of version control systems, I recommend reading the introduction at Ry’s Git Tutorial. You might also want to bookmark his tutorial and work your way through all of the sections as he does a terrific job demonstrating Git’s feature set in much greater detail than what I’m covering in this post.
Before we move on, let’s clarify something that often comes up when people first start learning about Git. Many of you may have heard of companies like GitHub, GitLab or Bitbucket. These are each examples of web-based repository hosting services. Git itself is just a lightweight command line tool. Services like GitHub provide software development platforms that center around the use of Git but add a rich suite of additional features. The focus of this post is on learning the fundamentals of the Git command line tool.
A basic understanding of Linux (Unix-like) commands is assumed. Specifically, this post makes extensive use of the following:
mkdir
: create new directoriescd
: change the current working directoryecho
: display a line of textcat
: concatenate files and print on the standard outputls
: list directory contentsrm
: remove files or directoriesFor brevity, I am using echo
with redirection (>
) to write files. In reality, you’d be using a visual editor such as vi
to write and edit files.
Before we can do anything with Git, we must initialize a directory as a Git repository. Let’s do so in a brand new directory that we’ll create called fundamentals
underneath our home directory (~
):
The first thing you’ll want to do with any new Git repository is add a README.md file to the project root. As a plain text file, it will be the easiest place to save and read notes about your project. The “.md” extension indicates that it is a markdown file. Markdown is a text-to-HTML conversion tool that allows you to create easy-to-read and easy-to-write plain text files which get converted to HTML. Hosting services like GitHub and GitLab will automatically render your README.md file to HTML at the main repository site (for example: https://github.com/rstudio/blogdown). Here is a brief primer on some of the more commonly used markdown syntax:
Now let’s actually create our README.md file. A popular convention (and one that I use) is to have the first line of your README.md file be an <h1>
header with the name of your repository:
Now that we actually have a file in our repository, we are ready to use Git. The one command that you’ll find yourself using regularly in order to check the status of your project and see what changes have occurred since the last “clean” state is git status
:
On branch main
No commits yet
Untracked files:
(use "git add <file>..." to include in what will be committed)
README.md
nothing added to commit but untracked files present (use "git add" to track)
Notice the response from our command. It lists our README.md file as an untracked file. This is Git’s way of telling you that a new file is present in the repository. It also says to use "git add"
to track.
Git allows you to review your changes before they get recorded into version control. This is called staging. You can add or remove files from the current staging area (“snapshot”) using git add
and git rm
:
On branch main
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: README.md
Now, README.md is being tracked and is ready to be committed into version control.
When you are ready to officially record (“commit”) your staged changes, use the git commit
command. Doing so will prompt you for a commit message (all commits get accompanied by a message), however, you can avoid the prompt and supply the message with the commit by using the -m
option:
[main (root-commit) 33276a2] initial commit
1 file changed, 1 insertion(+)
create mode 100644 README.md
On branch main
nothing to commit, working tree clean
This will commit any staged files. Note that each commit is given a unique identifer known as a SHA-1 hash.
Use the git log
command to print a summary of all the commits that you’ve made. This will include the full commit hash, author, date and message. For an abbreviated result, use the --oneline
option which will only print out the commit message and the first 7 characters from the commit hash:
commit 33276a2309d5b5347a72754220d7fdf3320617ca
Author: Dan Negrey <dnegrey@gmail.com>
Date: Tue Feb 7 20:51:18 2023 -0500
initial commit
There may be instances when you want Git to ignore certain files in your repository. A good example of this includes working data files that your code produces while it is executing. Generally, if you subscribe to the principles of reproducible research, then you should be able to ignore any ouput files that your code produces as your code should be able to reproduce the output when needed. For Git to ignore certain files, you’ll need to create a .gitignore
file in your project root, and list in it the file names or patterns that describe what is to be ignored.
Let’s confirm a clean working directory:
Now, let’s create a few data files that we’ll want to ignore:
total 4
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data1.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data2.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data3.csv
-rw-r--r-- 1 dan dan 15 Feb 7 20:51 README.md
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
data1.csv
data2.csv
data3.csv
nothing added to commit but untracked files present (use "git add" to track)
Next, we simply create our .gitignore
file with the correct pattern to ignore these new data files:
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
.gitignore
nothing added to commit but untracked files present (use "git add" to track)
total 20
drwxr-xr-x 3 dan dan 4096 Feb 7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb 7 20:51 ..
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data1.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data2.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data3.csv
drwxr-xr-x 8 dan dan 4096 Feb 7 20:51 .git
-rw-r--r-- 1 dan dan 10 Feb 7 20:51 .gitignore
-rw-r--r-- 1 dan dan 15 Feb 7 20:51 README.md
Great! Git will now ignore any file in our repository that matches the pattern data*.csv
. However, it recognizes that we have introduced a new file - namely, the .gitignore
file. So we simply add .gitignore
to the staging area and commit:
[main 7d1bece] added .gitignore
1 file changed, 1 insertion(+)
create mode 100644 .gitignore
On branch main
nothing to commit, working tree clean
Let’s add a new file to our project:
[main 337001e] added file1
1 file changed, 1 insertion(+)
create mode 100644 file1
Now, this is where things start to heat up! Let’s make a change to file1
:
On branch main
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: file1
no changes added to commit (use "git add" and/or "git commit -a")
Based on the above response from git status
, Git is aware that file1
has changed. But, recall staging! This change has not yet been staged. Sometimes, changes like this might occur by accident or they may no longer be desired. To undo unstaged changes to a file, use the git checkout --
command:
Voila! Everything is back to the way it was before we changed file1
.
In some cases, it will be necessary to undo an entire commit. To do so, use the git revert
command and supply it with the SHA-1 hash of the commit that you would like to revert. Suppose we didn’t want a .gitignore
file. We would want to revert our second commit:
git revert 7d1bece
[main 3d77b78] Revert "added .gitignore"
Date: Tue Feb 7 20:51:18 2023 -0500
1 file changed, 1 deletion(-)
delete mode 100644 .gitignore
On branch main
Untracked files:
(use "git add <file>..." to include in what will be committed)
data1.csv
data2.csv
data3.csv
nothing added to commit but untracked files present (use "git add" to track)
3d77b78 Revert "added .gitignore"
337001e added file1
7d1bece added .gitignore
33276a2 initial commit
total 20
drwxr-xr-x 3 dan dan 4096 Feb 7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb 7 20:51 ..
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data1.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data2.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data3.csv
-rw-r--r-- 1 dan dan 14 Feb 7 20:51 file1
drwxr-xr-x 8 dan dan 4096 Feb 7 20:51 .git
-rw-r--r-- 1 dan dan 15 Feb 7 20:51 README.md
Recall that the point of Git is to track all of your changes! Notice that git revert
did not simply roll back or remove the specified commit. Instead, it created a new commit reflective of the state we desired. In fact, by removing our .gitignore
file, Git is now aware of the data files that it was previously ignoring! We can now revert our revert to get back our .gitignore
file:
git revert 3d77b78
[main d19f5b3] Revert "Revert "added .gitignore""
Date: Tue Feb 7 20:51:18 2023 -0500
1 file changed, 1 insertion(+)
create mode 100644 .gitignore
d19f5b3 Revert "Revert "added .gitignore""
3d77b78 Revert "added .gitignore"
337001e added file1
7d1bece added .gitignore
33276a2 initial commit
total 24
drwxr-xr-x 3 dan dan 4096 Feb 7 20:51 .
drwxr-x--- 22 dan dan 4096 Feb 7 20:51 ..
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data1.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data2.csv
-rw-r--r-- 1 dan dan 0 Feb 7 20:51 data3.csv
-rw-r--r-- 1 dan dan 14 Feb 7 20:51 file1
drwxr-xr-x 8 dan dan 4096 Feb 7 20:51 .git
-rw-r--r-- 1 dan dan 10 Feb 7 20:51 .gitignore
-rw-r--r-- 1 dan dan 15 Feb 7 20:51 README.md
With just a few simple commands, you’ve taken your first step into a larger world! The use of version control, and more specifically Git, may be a total paradigm shift for you. It may seem challenging to learn and difficult to incorporate into your everyday workflow. But over time, this will subside, you’ll learn to work in far more efficient ways than you did previously and Git will become an indispensable tool that is integral to your approach. And we’ve only just scratched the surface. Git’s branching model is one of its key differentiators and will be next up on our list along with remote repositories, so stay tuned to the blog by following me on Twitter @NegreyDan!