Extracting Code from a git Repo
Today I nerd sniped myself. I wanted to do something completely different, but then decided to extract the JavaScript compressor that is built into DokuWiki into its own repository.
My goal was:
- have new repository with only the relevant code
- have a working git commit history for that code
- do not have any old cruft in the repo that is unrelated to the code
It turns out this isn't so easy, but in the end I managed.
Quick side note: I used ChatGPT a lot for this to figure out how stuff works. It was quite helpful to get started, but whenever things got more difficult it started to hallucinate command options which was a bit frustrating. So in the end I had to combine its advice with good old googling and thinking myself.
Now to get started, you need to clone the original repo. I used a HTTP checkout, just to make sure that I wouldn't accidentally push anything back to origin.
In my case I needed only two files and a directory:
lib/exe/js.php
_test/tests/lib/exe/js_js_compress.test.php
_test/tests/lib/exe/js_js_compress/
Additionally, only one function in the js.php
file was of interest. So the first step was to edit that file, remove everything but that function and commit the change.
Next, git-filter-repo is the hero of the day. I installed it via AUR on my ArchLinux system.
# delete all tags git tag | xargs git tag -d # delete all branches git branch | grep -v "master" | xargs git branch -D # remove everything we don't want git filter-repo \ --path '_test/tests/lib/exe/js_js_compress' \ --path '_test/tests/lib/exe/js_js_compress.test.php' \ --path 'lib/exe/js.php' \ --replace-refs delete-no-add \ --prune-empty always \ --prune-degenerate always \ --commit-callback ' commit.message += b"\n\nOriginal commit:\ndokuwiki/dokuwiki@" + commit.original_id ' \ --force
The above call will rewrite the history. The path options tell it what files we want to keep, filter-repo will remove all commits that do not touch these files. It will also remove all changes to unrelated files from these commits.
I'm not sure how necessary the prune and replace-refs options are.
The commit callback will append the original commit ID to each commit that is kept. Useful if some greater context is needed in the future.
At this point, the repos still contains a whole bunch of commits that touched the js.php
file but addressed other functions in that file. Functions we no longer have or care about.
Ideally filter-repo would be able to remove those, too. But I couldn't figure out how. Instead I opted to use use git blame
to get the commits that are still relevant, then use git filter-branch
to remove all other commits:
- prune.sh
#!/bin/bash # Get the commit hashes to keep: # We want the newest commit (which removed all unwanted cruft) # We want all commits that show up in a git blame on any existing file commit_hashes=$(git rev-parse HEAD)" "$(git ls-files | xargs -I{} git blame --minimal --abbrev=40 -- {}|grep -vF '^'| awk '{print $1}' | sort -u) # Rewrite the Git history to remove obsolete commits if [ "$commit_hashes" != "" ]; then git filter-branch --commit-filter ' if echo "'"$commit_hashes"'" | grep -q "$GIT_COMMIT"; then git commit-tree "$@"; else skip_commit "$@"; fi' HEAD fi
The result is a git repo with just those commits that address the current code.
From there it's easy to continue to clean up the repo structure with git mv
.