Diff binary files like docx, odt and pdf with git

conversion_tools

Working with binary file types like the Microsoft Word XML Format Document docx , the OpenDocument Text odt format and the Portable Document Format pdf in combination with git has its difficulties. Out of the box, git only provides diffing for plain text formats. Comparing binary files in textual format is not supported.

With a simple configuration change and some open source, cross-platform tools, git can be adapted to diff those formats as well.

Installing the tools

First, one needs the tools which can convert the binary files to plain text formats. For most formats like docx and odt , the open source tool Pandoc [1] will do the trick. It can even export those files to Markdown format, or (my personal choice) reStructuredText [2]. A markup language like reStructuredText makes it possible to make a detailed comparison between structured documents, for instance when the heading level changed.

For PDF, there's the open source tool pdftotext , which is part of the Poppler [3] utils package and available for (almost) all operating systems. This can convert a PDF file to plain text.

There's a tiny catch with pdftotext , as it has issues using stdout as output, instead of writing to files. This is …

more ...


Generate list of used content tags for Pelican

If your Pelican-generated site uses lots of different tags for articles, it can be difficult to remember or use tag names consistently. Therefore I needed a quick method to print (comma separated) unique tags that were stored in text files.

This shell one-liner from within the content directory will sort and show all tags from reStructuredText ( *.rst ) files:

grep -h '^:tags:' *.rst | sed -e 's/^:tags:\s*//;s/\s*,\s*/\n/g' | sort -u

First grep will filter on the :tags: property and will only print out the matching line (without filename, thanks to the -h flag).

Then sed will remove the :tags: keyword (and trailing spaces), and all tags will be split using newline characters.

Finally, sort takes care of sorting and only printing unique entries.

Analogous, one can do the same for categories:

grep -h '^:category:' *.rst | sed -e 's/^:category:\s*//' | sort -u

As Pelican only allows one category, this is somewhat simpler.

For maximum readability, tr can convert the newlines into spaces, so that the output is one big line:

grep -h '^:tags:' *.rst | sed -e 's/^:tags:\s*//;s/\s*,\s*/\n/g' | sort -u | tr '\n' ' '; echo

The last echo is meant to end …

more ...

Convert WordPress to static site generator Pelican

Pelican

After a number of years using WordPress as blogging software, I converted the site to a static site generator: Pelican.

Pelican converts reStructuredText into static HTML. No more PHP, no more databases, but straight static HTML.

The process of converting the site was relatively painless. The conversion tool did a great job of converting an XML export of WordPress into reStructuredText pages.

What needed (and still needs) some manual care were/are the code blocks (the biggest reason of the move from WordPress to Pelican) in articles, and the escaping of variables. WordPress gets pretty complex once you're trying to use it for code snippets and console outputs. The reStructuredText is much more flexible and allows you to edit the site using any text editor. There are tools to do that with WordPress and its API, but it always felt like a difficult workaround.

I thought about keeping the URLs as-is: Over the years the number of visitors of the site has steadily risen, as has the level of indexing by search engines. You don't want dead links - but on the other hand, a transition to another content management system would be the perfect moment to 'clean up' the category …

more ...