Not long ago, together with my colleague Krzysztof Babis, who is also a co-author of this article, we faced the need to modify the Git repository history. After a short research we’ve found a way that was git-filter-repo tool.
In this step-by-step guide, you’ll learn how to use git-filter-repo to efficiently rewrite the history of your Git repository.
Before we go any further, remember to backup your repository before using git-filter-repo, as it makes permanent changes to your history.
You may face this type of task for a variety of reasons, such as migration to another repository vendor, changes in code ownership, staff turnover, or common errors such as pushing files with sensitive data.
As a preview of what the filtering repo can do, the last example can be fixed with a single command:
git filter-repo --path sensitive-file.txt --invert-paths
Our goal was a bit different, as we needed to change the email addresses of the commit authors as a result of the brand and domain name changes, as well as the corresponding strings in the commit messages, and of course the contents of the files. The possibilities are almost endless, as this is just the tip of the iceberg, but here we focus on these two main issues.
Let’s take a look at some basic operations and see some two direct examples from the documentation:
1. Simple modifications of commit messages with —replace-message
If you want to modify commit or tag messages, you can do so with the same syntax as --replace-text
, explained above. For example, with a file named expressions.txt containing
foo==>bar
then running
git filter-repo --replace-message expressions.txt
will replace foo
in commit or tag messages with bar
.
2. Basic changing user and email based
To modify username and emails of commits, you can create a mailmap file in the format accepted by git-shortlog. For example, if you have a file named my-mailmap you can run
git filter-repo --mailmap my-mailmap
and if the current contents of that file are as follows (if the specified mailmap file is version controlled, historical versions of the file are ignored):
Name For User <email@addre.ss>
<new@ema.il> <old1@ema.il>
New Name And <new@ema.il> <old2@ema.il>
New Name And <new@ema.il> Old Name And <old3@ema.il>
Prerequisites
Git-filter-repo is a command-line tool. So if you’ve pushed a file that contains, say, sensitive or leaked data, you can use it directly for non-trivial changes. For more complicated solutions, scripts are the way to go. In the second part of this article, we’ll see a step-by-step tutorial on how to build this script.
In order to use git-filter-repo as part of our script there are some prerequisite steps one have to follow:
- python3 (git-filter-repo is a script written in Python, so
pip
is used to install it - terminal
- IDE to write script that will be executed agains our git
Making custom script and how it works
Problem #1: Unstructured commit author data
The first step was to analyze the current state of affairs, which fortunately could be done using git log
: git log --format='%an <%ae>'
The repository has been expanded over the years, so we expected to receive many authors on the list. presented in a not entirely consistent way and the results did not surprise us:
Adam Jones < adam.jones@outdated.com >
Adam Jones < adamjones@MacBook-Pro-Adam.local >
Bart Clear < bart.clear@outdated.com >
bart.clear@outdated.com <Bart Clear>
Robot - Build Service (outdated) <Robot - Build Service (outdated)>
Outdated Build Service <Outdated - Build Service (outdated)@outdated.com>
Josh <>
We have identified the following issues:
- Some users have different email addresses
- Some commits do not have their author data configured correctly
- There are also some special cases that stand out, such as
Robot - Build Service
Based on them, we defined the following goals to achieve:
- Unification of the e-mail domain → Each user should have an address in
@example.com
- Assigning missing email addresses → If the email is empty, we generate it based on the first and last name.
- Replacing specific names → Build Services should have specific names and addresses
- Fixed the situation where name and surname are saved as email address and vice versa
Following the KISS principle, we kept it simple and iterated based on a scaled-down copy of the repository.
Step 1: Email Unification
To start with, we created a simple function that replaces the email domain with @example.com
.
import subprocess
callback_code = '''
def map_email(email):
email_prefix = email.split("@")[0]
return f"{email_prefix}@example.com"
commit.author_email = map_email(commit.author_email.decode("utf-8")).encode("utf-8")
commit.committer_email = map_email(commit.committer_email.decode("utf-8")).encode("utf-8")
'''
subprocess.run([
'git', 'filter-repo', '--force', '--commit-callback', callback_code
], check=True)
Step 2: Handling empty emails
Next, we added support for cases where commit authors didn’t include their email addresses. We generated them according to the company’s mailing address policy:
def map_email(name, email):
if email == "" or email == "<>":
name_surname = name.lower().replace(" ", ".").replace("-", ".")
return f"{name_surname}@example.com"
email_prefix = email.split("@")[0]
return f"{email_prefix}@example.com"
Step 3: Special Cases
It’s time to handle special cases, which was initially helped by the added map_name
function:
def map_name(name):
if "Robot - Build Service (outdated)" in name:
return "Robot Build Service"
if "Outdated Build Service" in name:
return "Media Build Service"
return name
We left the best part, i.e. replacing the name and surname with the email address, for last, using a regular expression (regex):
import re
def map_name(name):
if re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$', name):
name_part = name.split("@")[0]
first_name, last_name = name_part.split(".")
return f"{first_name.capitalize()} {last_name.capitalize()}"
return name
We used this in the script fragment responsible for transforming the author and committer information for each history entry:
# Rewrite author information with debug output
author_name = commit.author_name.decode("utf-8")
author_email = commit.author_email.decode("utf-8")
new_author_name = map_name(author_name)
new_author_email = map_email(author_name, author_email)
print(f"Rewriting Author: {author_name} <{author_email}> to {new_author_name} <{new_author_email}>")
commit.author_name = new_author_name.encode("utf-8")
commit.author_email = new_author_email.encode("utf-8")
# Rewrite committer information with debug output
committer_name = commit.committer_name.decode("utf-8")
committer_email = commit.committer_email.decode("utf-8")
new_committer_name = map_name(committer_name)
new_committer_email = map_email(committer_name, committer_email)
print(f"Rewriting Committer: {committer_name} <{committer_email}> to {new_committer_name} <{new_committer_email}>")
commit.committer_name = new_committer_name.encode("utf-8")
commit.committer_email = new_committer_email.encode("utf-8")
So bart.clear@outdated.com <Bart Clear>
became Bart Clear < bart.clear@outdated.com >
.
Problem #2: Unacceptable phrases in commit titles
We needed to remove sensitive information and standardize naming conventions, so we improved our script to automatically update commit messages and replace specific text throughout a repository’s history.
Step 1: Setting the Goal
We wanted to replace certain words in commit messages and file contents across all Git history. The replacements are:
"to_be_replaced"
→"example"
"replace_me"
→"example"
"replaceable"
→"example"
"renameable"
→"example"
Step 2: Preparing a List of Replacements with or without Regular Expressions
First, we needed a structured way to store our text replacements. We created a mapping list that followed Git’s replace-text
format:
replace_text_code = '''
to_be_replaced==>example
replaceable==>example
replace_me==>example
renameable==>example
'''
Each line specified a replacement in the form:
old_text==>new_text
This was used to replace occurrences of these words in file contents.
If we need more robust solution to filter case insensitive texts or other regular expressions, we have to use specific syntax
regex:(?i)old_text==>new_text
In our case this will replace text no matter of case used, so OLD_text, Old_Text etc.
replace_text_code = '''
regex:(?i)to_be_replaced==>example
regex:(?i)replaceable==>example
regex:(?i)replace_me==>example
regex:(?i)renameable==>example
'''
Step 3: Modifying Commit Messages with Regular Expressions
Next, we needed to make sure commit messages also followed these rules. We used regular expressions to search and replace text while keeping the message formatting intact.
commit_callback_code = '''
message = re.sub(b"to_be_replaced", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"replaceable", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"replace_me", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"renameable", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
return message'''
Breaking It Down:
re.sub(pattern, replacement, message, flags=...)
→ Searches for a pattern and replaces it.b"to_be_replaced"
→ Theb
prefix ensures we’re working with binary strings, which is needed forgit filter-repo
.flags=re.MULTILINE | re.IGNORECASE
→ This makes the replacement case-insensitive and allows for changes in multiline messages.
Step 4: Writing the Replacement Rules to a Temporary File
Git’s filter-repo
command requires a file containing replacement rules. Instead of manually creating a file, we generated a temporary one:
import tempfile
with tempfile.NamedTemporaryFile(mode='w', delete=False) as replace_file:
replace_file.write(replace_text_code)
replace_file_path = replace_file.name # Store the file path
Why Use a Temporary File?
- Prevents cluttering your project with extra files.
- Automatically cleans up after execution.
Step 5: Running git filter-repo
to Rewrite History
We finally applied the changes using subprocess.run()
, which calls a system command from Python:
import subprocess
try:
subprocess.run([
'git', 'filter-repo', '--force',
'--message-callback', 'update_commit_message',
'--replace-text', replace_file_path
], check=True)
print("Git history rewritten successfully.")
except subprocess.CalledProcessError as e:
print(f"Error while running git-filter-repo: {e}")
Breaking It Down:
git filter-repo --force
→ Forces execution, even if there are warnings.-message-callback 'update_commit_message'
→ Runs our Python function to modify commit messages.-replace-text replace_file_path
→ Uses the temporary file to replace text in the repository.
Step 6: Handling Errors
If something goes wrong (eg, git filter-repo
is not installed or a commit message contains unexpected content), we catch the error:
except subprocess.CalledProcessError as e:
print(f"Error while running git-filter-repo: {e}")
Combined final solution
After the above steps, the script took the following form:
import subprocess
import tempfile
callback_code = '''
import re
def map_email(name, email):
if email == "" or email == "<>":
name_surname = name.lower().replace(" ", ".").replace("-", ".")
return f"{name_surname}@example.com"
email_prefix = email.split("@")[0]
return f"{email_prefix}@example.com"
def map_name(name):
if "Robot - Build Service (outdated)" in name:
return "Robot Build Service"
if "Outdated Build Service" in name:
return "Media Build Service"
if re.match(r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\\\.[a-zA-Z]{2,}$', name):
name_part = name.split("@")[0]
first_name, last_name = name_part.split(".")
return f"{first_name.capitalize()} {last_name.capitalize()}"
return name
# Rewrite author information with debug output
author_name = commit.author_name.decode("utf-8")
author_email = commit.author_email.decode("utf-8")
new_author_name = map_name(author_name)
new_author_email = map_email(author_name, author_email)
print(f"Rewriting Author: {author_name} <{author_email}> to {new_author_name} <{new_author_email}>")
commit.author_name = new_author_name.encode("utf-8")
commit.author_email = new_author_email.encode("utf-8")
# Rewrite committer information with debug output
committer_name = commit.committer_name.decode("utf-8")
committer_email = commit.committer_email.decode("utf-8")
new_committer_name = map_name(committer_name)
new_committer_email = map_email(committer_name, committer_email)
print(f"Rewriting Committer: {committer_name} <{committer_email}> to {new_committer_name} <{new_committer_email}>")
commit.committer_name = new_committer_name.encode("utf-8")
commit.committer_email = new_committer_email.encode("utf-8")
commit.author_name = map_name(commit.author_name.decode("utf-8")).encode("utf-8")
commit.author_email = map_email(commit.author_name.decode("utf-8"), commit.author_email.decode("utf-8")).encode("utf-8")
commit.committer_name = map_name(commit.committer_name.decode("utf-8")).encode("utf-8")
commit.committer_email = map_email(commit.committer_name.decode("utf-8"), commit.committer_email.decode("utf-8")).encode("utf-8")
'''
commit_callback_code = '''
message = re.sub(b"to_be_replaced", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"replaceable", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"replace_me", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
message = re.sub(b"renameable", b"example", message, flags=re.MULTILINE | re.IGNORECASE)
return message'''
replace_text_code = '''
regex:(?i)to_be_replaced==>example
regex:(?i)replaceable==>example
regex:(?i)replace_me==>example
regex:(?i)renameable==>example
'''
try:
with tempfile.NamedTemporaryFile(mode='w', delete=False, encoding='utf-8') as replace_file:
replace_file.write(replace_text_code)
replace_file.close()
print(f"Temporary file created at: {replace_file.name}")
subprocess.run([
'git', 'filter-repo', '--force', '--commit-callback', callback_code, '--message-callback', commit_callback_code, '--replace-text', replace_file.name
], check=True)
print("Git history rewritten successfully.")
except subprocess.CalledProcessError as e:
print(f"Error while running git-filter-repo: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
The time has come for the grand finale, in which:
- Being aware of the irreversibility of the operation, we made a backup copy of the repository before running the script
- We reminded the team members about the need to clone the repository again
- We ran the script via
python3 rewrite-authors.py
- We checked the result using
git log --format='%an <%ae>'
- We pushed the changes using
git push --force --all
to overwrite the history
Final words
That’s it! As you can see, filter-repo is a very powerful tool. It allows us to modify git history in a way that completely rewrites author names, emails, commit messages, and their contents.
Whether you need to remove sensitive data, standardize author details, or reorganize your commit messages, filter-repo provides a robust and flexible solution.
I’d also like to give a special thanks to Krzysztof Babis, who you can find on LinkedIn. Krzysztof is an excellent iOS developer and created the underlying script (the main part) and the step-by-step explanation.