Working with binary file types like the Microsoft Word XML Format Document
, the OpenDocument Text
format and the Portable Document
With a simple configuration change and some open source, cross-platform tools, git can be adapted to diff those formats as well.
Installing the tools
First, one needs the tools which can convert the binary files to plain text
formats. For most formats like
, the open source
tool Pandoc  will do the trick. It can even export those files to Markdown
format, or (my personal choice) reStructuredText . A markup language like
reStructuredText makes it possible to make a detailed comparison between
structured documents, for instance when the heading level changed.
There's a tiny catch with
, as it has issues using stdout as
output, instead of writing to files. This is necessary, as git will expect the
output on stdout.
This can be fixed by creating a tiny wrapper named
, which will execute the program with the correct parameters. A
dash as last parameter will instruct
to use stdout:
echo "pdftotext $1 -" > /usr/bin/pdftostdout
Of course this wrapper can be stored anywhere, as long as it can be executed and found by git.
Add new text conversion handlers to git
After installing both programs and the wrapper, next git needs to be instructed how to convert the binary file types to text format. This can be accomplished by modifying the global git configuration:
git config --global diff.docx.textconv pandoc --to=rst git config --global diff.odt.textconv pandoc --to=rst git config --global diff.pdf.textconv pdftostdout
This creates new diff handlers for each of the file types.
Using the parameter
specifies pandoc to use the
reStructuredText format. This makes comparing hierarchies easier than just
using the plain text format.
Instruct git to apply the correct handlers per file type
The gitattributes file defines attributes per path, or per file. That means that you can specify handlers per file _type_, which will automatically convert the binary format to text format, using the correct tool.
The gitattributes file can be specified locally (per git repository), per
system, or globally. Globally is usually the preferred choice, as this means
configure once per user, and use everywhere, with each repository. The global
gitattributes file can be found under
As the global and system git attributes files have the lowest precedence,
they can easily be overridden on a local base. This can be done by creating a
file in the root of a repository.
The following code-snippet adds the correct conversion handlers per file type to the global git configuration:
echo "*.docx diff=docx" >> ~/.config/git/attributes echo "*.odt diff=odt" >> ~/.config/git/attributes echo "*.pdf diff=pdf" >> ~/.config/git/attributes
And that's all there is to it. Now
will show all changes in
plain text format for the binary file types
Any binary format can be diffed with git, as long as there's a tool which converts the binary format to plain text. One just needs to add the conversion handlers and attributes in the same way.