conversion_tools

Working with binary file types like the Microsoft Word XML Format Document docx , the OpenDocument Text odt format and the Portable Document Format pdf in combination with git has its difficulties. Out of the box, git only provides diffing for plain text formats. Comparing binary files in textual format is not supported.

With a simple configuration change and some open source, cross-platform tools, git can be adapted to diff those formats as well.

Installing the tools

First, one needs the tools which can convert the binary files to plain text formats. For most formats like docx and odt , the open source tool Pandoc [1] will do the trick. It can even export those files to Markdown format, or (my personal choice) reStructuredText [2]. A markup language like reStructuredText makes it possible to make a detailed comparison between structured documents, for instance when the heading level changed.

For PDF, there's the open source tool pdftotext , which is part of the Poppler [3] utils package and available for (almost) all operating systems. This can convert a PDF file to plain text.

There's a tiny catch with pdftotext , as it has issues using stdout as output, instead of writing to files. This is necessary, as git will expect the output on stdout.

This can be fixed by creating a tiny wrapper named pdftostdout around pdftotext , which will execute the program with the correct parameters. A dash as last parameter will instruct pdftotext to use stdout:

echo "pdftotext $1 -" > /usr/bin/pdftostdout

Of course this wrapper can be stored anywhere, as long as it can be executed and found by git.

Add new text conversion handlers to git

After installing both programs and the wrapper, next git needs to be instructed how to convert the binary file types to text format. This can be accomplished by modifying the global git configuration:

git config --global diff.docx.textconv pandoc --to=rst
git config --global diff.odt.textconv pandoc --to=rst
git config --global diff.pdf.textconv pdftostdout

This creates new diff handlers for each of the file types.

Note

Using the parameter --to=rst specifies pandoc to use the reStructuredText format. This makes comparing hierarchies easier than just using the plain text format.

Instruct git to apply the correct handlers per file type

Finally, git needs to know which conversion handler to use for which file type. That can be accomplished by modifying the global gitattributes [4] file.

The gitattributes file defines attributes per path, or per file. That means that you can specify handlers per file _type_, which will automatically convert the binary format to text format, using the correct tool.

The gitattributes file can be specified locally (per git repository), per system, or globally. Globally is usually the preferred choice, as this means configure once per user, and use everywhere, with each repository. The global gitattributes file can be found under $HOME/.config/git/attributes .

Note

As the global and system git attributes files have the lowest precedence, they can easily be overridden on a local base. This can be done by creating a .gitattributes file in the root of a repository.

The following code-snippet adds the correct conversion handlers per file type to the global git configuration:

echo "*.docx diff=docx" >> ~/.config/git/attributes
echo "*.odt diff=odt" >> ~/.config/git/attributes
echo "*.pdf diff=pdf" >> ~/.config/git/attributes

And that's all there is to it. Now git diff will show all changes in plain text format for the binary file types docx , odt and pdf .

Any binary format can be diffed with git, as long as there's a tool which converts the binary format to plain text. One just needs to add the conversion handlers and attributes in the same way.

[1]https://pandoc.org/
[2]http://docutils.sourceforge.net/rst.html
[3]https://poppler.freedesktop.org/
[4]https://git-scm.com/docs/gitattributes

Comments

comments powered by Disqus