Diff binary files like docx, odt and pdf with git


Working with binary file types like the Microsoft Word XML Format Document docx , the OpenDocument Text odt format and the Portable Document Format pdf in combination with git has its difficulties. Out of the box, git only provides diffing for plain text formats. Comparing binary files in textual format is not supported.

With a simple configuration change and some open source, cross-platform tools, git can be adapted to diff those formats as well.

Installing the tools

First, one needs the tools which can convert the binary files to plain text formats. For most formats like docx and odt , the open source tool Pandoc [1] will do the trick. It can even export those files to Markdown format, or (my personal choice) reStructuredText [2]. A markup language like reStructuredText makes it possible to make a detailed comparison between structured documents, for instance when the heading level changed.

For PDF, there's the open source tool pdftotext , which is part of the Poppler [3] utils package and available for (almost) all operating systems. This can convert a PDF file to plain text.

There's a tiny catch with pdftotext , as it has issues using stdout as output, instead of writing to files. This is …

more ...

Automating repetitive git / setup tasks

repetitite work

Imagine you work on a large number of projects. Each of those projects has its own git repository and accompanying notes file outside of the repo. Each git repository has its own githooks and custom setup.

Imagine having to work with multiple namespaces on different remote servers. Imagine setting up these projects by hand, multiple times a week.

Automation to the rescue ! Where I usually use Bash shell scripts to automate workflows, I'm moving more and more towards Python. It's cross-platform and sometimes easier to work with, as you have a large number of libraries at your disposal.

I wrote a simple Python script that does all of those things more or less 'automated'. Feed the script the name of the repository you want to clone, and optionally a namespace, patchfile and templatefile variable (either command-line or using a configuration file). The script will then:

  • clone the repository
  • modify the repository (e.g. apply githooks)
  • optionally modify the repository based on given variables
  • create a new notes file from a template
  • optionally modify the notes file based on given variables

The advantage is that you can use a configuration file containing the location of the remote git repository, the patchfile …

more ...

automatic XML validation when using git

Recently I worked on a project which involved manually editing a bunch of XML files. Emacs is my favorite ~operating system~ editor, and it has XML validation built in (using the nXML mode). It highlights validation errors while-you-type. Unfortunately, even with Emacs showing potential issues in RED COLOR, I managed to commit a number of broken XML files to my local git repository. Subsequently when I pushed my errors to the remote 'origin' git repository, the errors broke builds.

Of course this can be completely prevented by locally using pre-commit hooks. If your local git repository validates XML files before you can commit them, and denies invalid XML files, then one part of the problem is solved.

A pre-receive hook on the receiving server side can do the same as a pre-commit hook locally: Validate XML files before letting somebody push a commit which can break the build process.

I looked around the Internet but couldn't find a lightweight quick script to do only and exactly that. That's the reason I whipped up a basic pre-commit and pre-receive hook, written in Python.

You can find the very basic and rough code at https://github.com/PeterMosmans/git-utilities.
By changing the …
more ...

git on Windows - location of configuration files

Git is used as distributed version control system for the majority of projects I work on. On Windows I use the official Git for Windows version, as well as the 'native' mingw/MSYS2 git binary when using the MSYS2 shell.

The location of the system and global gitconfig configuration files varies, depending on which environment (native Windows command, Windows shell or MSYS2 shell) you're using, and depending on which binary (Git for Windows versus native git). There's a logic to it, but it can be hard to figure out...

Git version 2 introduced a much easier method of finding where the git configuration files are stored, the --show-origin flag. This parameter tells you exactly where each of the configuration files can be found.

Retrieve the locations (and name value pairs) of all git configuration files:

git config --list --show-origin

Retrieve the location (and name value pairs) of the system git configuration file:

git config --list --system --show-origin

Retrieve the unique locations of all git configuration files:

git config --list --show-origin | awk '{print $1}' | uniq


Regardless from where you use git on Windows, the repository (local) configuration always resides at the same location, in the root directory of your repository …

more ...