I had my PDF experiments inside DTFoundation. Those included a rather large (compared to the other source code) PDF file I was using for testing and the Demo. The problem with this was that I’m using DTFoundation almost everywhere now, being the central repository for all my generally reusable code.
Because of this file every cloning of the repository would take forever. So I decided to split the PDF stuff into its own repository and I deleted the file. However – since git keeps all history forever – the clones would still take long.
Fortunately git has a facility that allows you to change history. In my case to make it appear as if I had never checked the PDF file into DTFoundation. Thanks Alex Blewitt for pointing me in the right direction. I found the necessary steps documented under the heading Removing Sensitive Data.
First I needed to find out the full path to the PDF file from the root folder. That was easy to find by browsing the commit history on github. Before my history rewriting a clone would be 8.48 MB in size.
git clone git@github.com:Cocoanetics/DTFoundation.git Cloning into 'DTFoundation'... remote: Counting objects: 1912, done. remote: Compressing objects: 100% (649/649), done. remote: Total 1912 (delta 1237), reused 1908 (delta 1233) Receiving objects: 100% (1912/1912), 8.48 MiB | 382 KiB/s, done. Resolving deltas: 100% (1237/1237), done.
The next step was to use the filter-branch command which goes through the entire git history and applies the git rm command on each commit. Don’t ask me what those weird options mean, I’m glad it worked!
git filter-branch --index-filter 'git rm --cached --ignore-unmatch Demo/PDFDemo/Resources/RemovedFile.pdf' --prune-empty --tag-name-filter cat -- --all Rewrite 0f526d2a078373240495b554ccc758332e6f1068 (158/278)rm 'Demo/PDFDemo/Resources/RemovedFile.pdf' Rewrite 3b2624864bf4e1f20ba80146da5ad85f31218057 (163/278)rm 'Demo/PDFDemo/Resources/RemovedFile.pdf' Rewrite 773b516a3dbb3576122bcb5f70e5751506dcc079 (164/278)rm 'Demo/PDFDemo/Resources/RemovedFile.pdf' … Rewrite 34e5d7244543cbf8ee257cbcc10096f129f19489 (265/278)rm 'Demo/PDFDemo/Resources/RemovedFile.pdf' Rewrite ad0395249e74cef910d9f2533341453650a3f5ac (266/278)rm 'Demo/PDFDemo/Resources/RemovedFile.pdf' Rewrite 4e74124d5d0b1852af6f100657eb7df3fba5a92f (278/278) Ref 'refs/heads/master' was rewritten Ref 'refs/remotes/origin/master' was rewritten Ref 'refs/remotes/origin/gh-pages' was rewritten WARNING: Ref 'refs/remotes/origin/master' is unchanged
Notice that you need to put the entire path of the retroactively removed file (relative to the repo root) in here. You can see that the file was found by one Rewrite line appearing for each commit where this file was in existence.
After this was done you need to push the changes back up to github. Here it is key to use the –force option.
git push origin master --force
The first time I did that I didn’t have the full path for the file and so the push would still be as slow as the original clone with the file still in history. After I did it right, the push was done right away because git didn’t have to transmit the entire PDF any more.
Proof that the action was successful was visible on a fresh clone:
git clone git@github.com:Cocoanetics/DTFoundation.git Cloning into 'DTFoundation'... remote: Counting objects: 1911, done. remote: Compressing objects: 100% (660/660), done. remote: Total 1911 (delta 1235), reused 1897 (delta 1221) Receiving objects: 100% (1911/1911), 1.85 MiB | 300 KiB/s, done. Resolving deltas: 100% (1235/1235), done.
Much better! Now we’re down to 1.85 MB which speeds up the transfer quite a bit.
Another place where I could see that I had changed history was to reload the commit detail page on github. After the reload I wouldn’t see the file any more. The –prune-empty option also removes commits that would become empty if the removed file was the only one that had gotten changed.
Conclusion
I feel a bit queasy from this endeavor because filter-branch is the sort of command that might remove too much if you are not careful. But the application of our time travel experiment was a full success.
This is the great trick to have up your sleeve if you you too need to change history.
Categories: Recipes
Do you need to clone that repository often?
Rewriting history can lead to strange behavior/problems when working in teams, like when rebasing a public branch.