git clone https://github.com/{username}/{repository name}
cd {repository name}
Git: Splitting Repos and Scrubbing Sensitive Data
Last updated on November 15, 2023 -
Star me on GitHub →This website, or rather its many parts, live in a monolithic GitHub repository. I wanted to split out the guides (like the one you are reading right now) into their own public GitHub repository.
At the same, I wanted to keep the rest private and somehow end up with a unified repository, where I directly "link" the public one, into the private one.
If you’re curious about how to do that and also how to remove sensitive data from any Git repo, this post is for you.
Splitting Repositories
Before you start, you’ll need to install git-filter-repo, a handy Python script that lets you do all the things you never even knew you wanted to do with a Git repository.
Follow the installation instructions here, essentially you need to download the git-filter-repo
Python script and put it somewhere on your $PATH
.
Then, do a full, new clone of your repository and cd
into it.
In my case, I wanted to make one specific subfolder of this repository (marcobehler-guides/eins/zwei/drei
) the ROOT for my new repository.
The following command did the trick:
git filter-repo --subdirectory-filter {relative-folder-path}
// e.g. git filter-repo --subdirectory-filter marcobehler-guides/eins/zwei/drei
You’ll end up with a new Git repository, that only contains the files from your specified subdirectory.
As a bonus, this command keeps the entire Git history for all those files!
git log
...
commit 53b84195d1197773b3c8969dc2ea07faef6041c7
Author: Marco Behler <marco@marcobehler.com>
Date: Mon Nov 13 17:15:32 2018 +0100
...
Subtree Merges
Now that I had two repositories, I asked myself how I could link these two, i.e. end up with one unified repository. Or put another way: I wanted to include the new repository into my old repository.
There seem to be two choices for this:
I went down the Subtree path. If you have experiences with Submodules, please let me know in the comments. For subtrees, you’ll want to execute these 3 steps:
-
Add the URL to your new repository as a remote to your (old) repository.
cd old-repository git remote add -f {remote name} {url} // e.g. git remote add -f marcobehler-guides https://github.com/marcobehler/marcobehler-guides.git
-
Make your old repository aware, that we (want to) merge possibly unrelated changes to it.
$ git merge -s ours --no-commit --allow-unrelated-histories {remote name + / + branch name} // e.g. $ git merge -s ours --no-commit --allow-unrelated-histories marcobehler-guides/main > Automatic merge went well; stopped before committing as requested
-
Copy the new repository’s content into a subfolder of your old repository.
git read-tree --prefix={relative subfolder path} -u {remote name}/{branch name} // e.g. git read-tree --prefix=marcobehle-guides/ -u marcobehler-guides/main
-
Tada! The files are now in your unified (old) repository.
Challenges with the subtree approach:
-
If there are new changes in the public repo, you’ll have to manually sync the changes.
git pull -s subtree {remote name} {branch name} // e.g. git pull -s subtree marcobehler-guides main
-
If you create a fresh clone of your unified repository in the future, you’ll also have to go through the steps above again, e.g. add the remote etc.
Does anyone know any better ways for the syncing?
Removing Sensitive Data
Along the way I noticed I wanted to remove a couple of files from my new repository and also remove any trace of these files/contents from the Git history. (It might even have been the case that a friend asked me how to get rid of a leaked credential in his repository )
While you can use git filter-repo
above to do that job, I used BFG Repo-Cleaner, because it seems to be simpler and faster (the website claims 10-720x - who wouldn’t need want that for a single run ;) ).
bfg
is a good, old Java program, so you’ll need to have a JDK installed. Then simply download the .jar
file and you can run it like so:
java -jar bfg.jar --delete-files {your relative file path with sensitive data}
//e.g. java -jar bfg.jar --delete-files mysubDir/passwords.txt
Important note: I erroneously assumed that BFG
will delete the file starting from my current commit. Not so.
BFG
will only delete the history of the file. Which means, you’ll actually first need to remove (git rm
) the file. Commit that change so it’s gone. Then run BFG
to clean up the history of the file.
Now there won’t be any trace of your sensitive data left.
Fin
That’s all. I have the feeling I’ll need another couple years to fully understand what Git, or rather tools like git filter-repo
are capable of doing. It almost looks like a runner up to ffmpeg in terms of complexity. So, stay tuned for more Git posts!
Meanwhile, you might enjoy my Git: Merge, Cherry-Pick & Rebase guide. Or, if you prefer video and are using IntelliJ IDEA, check out 5 great Git & IntelliJ IDEA Tricks.
There's more where that came from
I'll send you an update when I publish new guides. Absolutely no spam, ever. Unsubscribe anytime.
Comments
let mut author = ?
I'm @MarcoBehler and I share everything I know about making awesome software through my guides, screencasts, talks and courses.
Follow me on Twitter to find out what I'm currently working on.