Skip to main content Brad's PyNotes

Difflib Module

TL;DR

The difflib module provides tools for comparing sequences (especially text strings) and generating difference reports in various formats. It can find the similarity between strings, produce unified or context diffs like Unix diff tools, and identify close matches from a list of possibilities.

Interesting!

The get_close_matches() function makes fuzzy string matching trivially easy - perfect for “did you mean?” suggestions in command-line tools or fixing typos.

Finding Close Matches

The simplest entry point is get_close_matches(), which finds similar strings from a list:

python code snippet start

from difflib import get_close_matches

words = ['apple', 'banana', 'apricot', 'avocado', 'grape']
possibilities = get_close_matches('aple', words, n=3, cutoff=0.6)
print(possibilities)
# Output: ['apple', 'grape']

# Great for command suggestions
valid_commands = ['start', 'stop', 'restart', 'status']
user_input = 'stat'
suggestions = get_close_matches(user_input, valid_commands)
if suggestions:
    print(f"Did you mean: {suggestions[0]}?")
    # Output: Did you mean: start?
else:
    print("No suggestions found")

python code snippet end

The cutoff parameter (default 0.6) controls how similar strings must be, with 1.0 being identical and 0.0 accepting anything.

Measuring Similarity

SequenceMatcher calculates how similar two sequences are:

python code snippet start

from difflib import SequenceMatcher

def similarity_ratio(str1, str2):
    return SequenceMatcher(None, str1, str2).ratio()

print(similarity_ratio('hello world', 'hello there'))  # 0.6363...
print(similarity_ratio('Python', 'Python'))            # 1.0
print(similarity_ratio('Python', 'Java'))              # 0.0

# Works with any sequence
list1 = [1, 2, 3, 4, 5]
list2 = [1, 2, 4, 5, 6]
print(SequenceMatcher(None, list1, list2).ratio())    # 0.8

python code snippet end

The ratio() method returns a value between 0 and 1, where higher values indicate greater similarity. You can also use quick_ratio() for a faster (but less accurate) upper bound.

Generating Unified Diffs

Create Unix-style unified diffs between text files:

python code snippet start

from difflib import unified_diff

original = ['Line 1\n', 'Line 2\n', 'Line 3\n', 'Line 4\n']
modified = ['Line 1\n', 'Line 2 modified\n', 'Line 3\n', 'Line 5\n']

diff = unified_diff(original, modified,
                   fromfile='original.txt',
                   tofile='modified.txt')

for line in diff:
    print(line, end='')

python code snippet end

Output:

code snippet start

i--- original.txt
+++ modified.txt
@@ -1,4 +1,4 @@
 Line 1
-Line 2
+Line 2 modified
 Line 3
-Line 4
+Line 5

code snippet end

This format is identical to what diff -u produces, making it perfect for version control systems or patch generation.

Line-by-Line Comparison

For more detailed comparison with intra-line changes highlighted:

python code snippet start

from difflib import Differ

d = Differ()
text1 = ['Hello world\n', '''Python is great isn't it!\n''']
text2 = ['Hello world\n', '''Python is fabo isn't it!\n''']

result = list(d.compare(text1, text2))
for line in result:
    print(repr(line))

python code snippet end

Output shows '- ' for removed lines, '+ ' for added lines, ' ' for unchanged, and '? ' for intra-line markers:

code snippet start

'  Hello world\n'
"- Python is great isn't it!\n"
'?           ^^^ ^\n'
"+ Python is fabo isn't it!\n"
'?           ^ ^^\n'

code snippet end

HTML Output

For web applications, HtmlDiff generates side-by-side HTML comparison tables:

python code snippet start

from difflib import HtmlDiff

html_diff = HtmlDiff()
text1_lines = ['Line 1', 'Line 2', 'Line 3']
text2_lines = ['Line 1', 'Modified Line 2', 'Line 3']

# Generate complete HTML document with built-in styling
html_doc = html_diff.make_file(text1_lines, text2_lines,
                               fromdesc='Original',
                               todesc='Modified')

# Write to file
with open('diff_output.html', 'w') as f:
    f.write(html_doc)

# Note: make_table() is also available for embedding, but requires
# you to provide CSS styling to match the diff table classes

python code snippet end

The make_file() method produces a complete HTML document with all necessary styling, while make_table() gives you just the table for embedding (but you’ll need to add CSS for the diff highlighting). See a HTML diff example generated with HtmlDiff.make_file().

The difflib simplifies the work of sequence comparison, providing a mix of programmatic/mathematical tools for comparing data and also standardised diff output for displaying changes to people in a way that is familiar.

Also see: string module provides the building blocks for text manipulation, while regular expressions offer pattern-based text processing. For formatting text output, check out the textwrap module .

Reference: difflib - Helpers for computing deltas