Compares the text inside two XML documents and marks up the differences with
<del>
and <ins>
tags.This is the result of about 7 years of trying to get this right and coded simply. I've used code like this in one form or another to compare bill text on GovTrack.us <https://www.govtrack.us>.
Diff: This option uses difference algorithms to compare two XML documents to produce a third document called an XML Diffgram that contains the differences between them. Use this option with another XML Task using the Patch option to produce a smaller subset of data to insert into your data store. An example use of this task is extracting only the prices that have changed from a new price sheet. Comparing XML files in C# Using Microsoft's Diff and Patch Tool. XML comparison is no easy task, and especially when it comes to writing.
The comparison is completely blind to the structure of the two XML documents. It does a word-by-word comparison on the text content only, and then it goes back into the original documents and wraps changed text in new
<del>
and <ins>
wrapper elements.The documents are then concatenated to form a new document and the new document is printed on standard output. Or use this as a library and call
compare
yourself with two lxml.etree.Element
nodes (the roots of your documents).The script is written in Python 3.
Example
Comparing these two documents:
and:
Yields:
On Ubuntu, get dependencies with:
For really fast comparisons, get Google's Diff Match Patch library <https://code.google.com/p/google-diff-match-patch/>, as re-written and sped-up by @leutloff <https://github.com/leutloff/diff-match-patch-cpp-stl> and then turned into a Python extension module by me <https://github.com/JoshData/diff_match_patch-python>:
Or if you can't install that for any reason, use the pure-Python library:
This is also at <https://code.google.com/p/google-diff-match-patch/source/browse/trunk/python3/diff_match_patch.py>. xml_diff will use whichever is installed.
Finally, install this module:
Then call the module from the command line:
Or use the module from Python:
The two DOMs are modified in-place.
Optional Arguments
The
compare
function takes other optional keyword arguments:merge
is a boolean (default false) that indicates whether the comparison function should perform a merge. If true, dom1
will contain not just <del>
nodes but also <ins>
nodes and, similarly, dom2
will contain not just <ins>
nodes but also <del>
nodes. Although the two DOMs will now contain the same semantic information about changes, and the same text content, each preserves their original structure --- since the comparison is only over text and not structure. The new ins
/del
nodes contain content from the other document (including whole subtrees), and so there's no guarantee that the final documents will conform to any particular structural schema after this operation.word_separator_regex
(default r's+|[^sw]'
) is a regular expression for how to separate words. The default splits on one or more spaces in a row and single instances of non-word characters.differ
is a function that takes two arguments (text1, text2)
and returns an iterator over difference operations given as tuples of the form (operation, text_length)
, where operation
is one of '='
(no change in text), '+'
(text inserted into text2
), or '-'
(text deleted from text1
). (See xml_diff/__init__.py's default_differ
function for how the default differ works.)tags
is a two-tuple of tag names to use for deleted and inserted content. The default is ('del', 'ins')
.make_tag_func
is a function that takes one argument, which is either 'ins'
or 'del'
, and returns a new lxml.etree.Element
to be inserted into the DOM to wrap changed content. If given, the tags
argument is ignored.