Wednesday, November 12, 2014

Generating Documentation with Markdown and Pandoc

Introduction

Over the years I have written a lot of documentation. I would say that about 98% of it has been in Microsoft Word. The other 2% has been written in text, usually a readme.txt. I generally use text when I need a break from Word. (It turns out that I have been using a convention that is very similar to Markdown for my text documents and did not know it)

Problems with Word

For me, Word has been a necessary evil. I do not feel that it is a great tool for documentation. I find that I spend too much time and get too distracted with formatting. In particular, converting code to fixed-width consumes a lot of time.

For clarification: I am referring to documentation written by developers for developers on the same team. I am not referring to API documentation for external developers or end user documentation.

The other large issue I have with Word is that the file format is binary. I firmly believe that documentation should live as close to the source code as possible. For this reason I prefer storing documents in source control over an external wiki (or some lesser repository). But that means putting binary files in source control, and as you likely know by now, that causes problems with branching and merging. Specifically, most source control systems do not know how to merge binary files.

One of the reasons that I believe documentation should live with source code is specifically the case where you are branching the code. Take feature development for example. Suppose that a code change in a branch causes the documentation to change. If you have stored your documentation external to the code you are now faced with a dilemma. Do you change the doc to reflect the as-is state or the to-be? One or the other will be wrong. How will people know? Technically, you could put both in. However, once you merge you will have to remember to remove the old documentation. (Of course, the problem only gets worse if you have additional branches)

Now suppose that a bug fix causes a change to the documentation in the mainline branch. Again you are faced with the problem of deciding where this change goes. On the other hand, if it is in source control you are in a situation where the documentation needs to be merged. This brings us back to the problem with Word’s binary files. Again, they cannot be merged, and speaking from experience, large documents are very hard to merge manually.

Down the Hill We Go

Further options seem to go from bad to worse. I have seen Word documents stored on shared network drives. To me this is the worst of the worst. You are still stuck with Word, but now you have no version control at all. Furthermore, a strange thing seems to happen in this case: people stop collaborating. Suddenly, rather than change the document themselves, people start emailing changes the original author. It is a peculiar behaviour I have noticed.

Then there are all the external repositories, things like SharePoint. You do get the versioning back, but still lose the branching and merging capability. Another worst of the worst is the SharePoint wiki. Even more cumbersome to use than Word. At that point you are better putting Word documents in source control. Or alternatively, getting a usable wiki system. Or painting on cave walls.

In summary, the my order of preference:

  1. A branchable/mergable format in source control
  2. A binary file in source control
  3. Wiki
  4. Sharepoint (or similar) document repository
  5. Cave drawings
  6. Sharepoint wiki
  7. Network share

Alternatives to Word

Several alternatives to Word exist, however very few are available “out of the box” in most organizations. That has lead me in the direction of text with some sort of markup.

Text

I only recently encountered Markdown. Prior to that I was using my own syntax that was quite similar (which is not surprising given that they both have the same source: text email conventions). Marking up text is good, but not great. It can be branched and merged, but _I am italics_ does not scream out italics to everyone.

HTML

Another option is writing HTML. Again it is text with markup. However, HTML has two problems:

  1. If you think writing formatting Word documents is a pain, give HTML a whorl. (I suppose you could use an editor, but I am picturing hand written)
  2. It is not the input format for the final document. It is the document.

Now the second point is sort of moot if you think of a browser as the document viewer. There is not much difference between loading an HTML file in Chrome and loading a PDF in Acrobat. However, this does differ from the experience that you get with a tool like…

LaTeX

I finally got fed up with Word last year and began to look for a replacement. The idea of writing in text and generating a PDF (or some such document) was where I kept landing. Since Tex and LaTex are king, that is where I looked.

Things never really got off the ground with me and LaTeX. There are two connected issues I have with it:

  1. The syntax is complex. Not overly complex, but complex enough.
  2. Because of #1, I could not see getting it absorbed into the organization I was working for. Remember, I want to store the text files in source control.

PostScript

The last thing I looked at was writing PostScript by hand. This way I would store them in source control, but everyone else would think they were PDFs. However, PostScript is a little too cumbersome to write by hand (RTF would have the same issue).

At this point I put my search on hiatus. I had spent enough time, in vain, looking for alternatives. It was time to get back to work and that meant suffering through Word.

Enter Markdown

Introduction

It has been about six months since I had to write any developer documentation. Last week I wrote a bit of documentation for my current client and realized I did not know where to put it. I fired off a quick email to my manager asking where such things should live.
His answer:

You can create some Word docs… We can check those into TFS or put them up onto a SharePoint.

I might have cringed a bit.

Markdown

Here I was, once again looking at my old nemesis. To the web I went. I quickly found this discussion on StackOverflow:

http://stackoverflow.com/questions/12537/what-tools-are-used-to-write-documentation

The answer from Colonel Panic, in particular, caught my eye:

I write in Markdown, the same formatting syntax we use on Stack Overflow. Because the documents are plain text, they can live alongside code in version control. That’s useful.

I render the documents to HTML and PDF with the swiss army knife Pandoc. With a short stylesheet, these look better than documents from word processors.

Well now, what have we here? This is perfect! A simple markup that I already know and the ability to convert them to the format bosses love. I was sure that PDF would be an acceptable format but quick check of the website revealed that pandoc also supports conversion to DOCX (and about 25 other formats).

Pandoc

I downloaded and installed the Windows msi on my machine. Loading PowerShell, I found that it was not in the path. The documentation implies that it should just be there, so I checked the path from the system settings and found it was there. I am not sure why PowerShell was not picking it up. So… when in doubt, reboot.

Next I created a simple Readme.md and ran

pandoc Readme.md -o Readme.docx

And sure enough I have my Word doc. I could not be happier.

Next I tried

pandoc Readme.md -o Readme.pdf

Unfortunately, that resulted in the following error:

pandoc.exe: pdflatex not found. pdflatex is needed for pdf output.

First I found a blog recommending I download protext.exe from http://tug.ctan.org/tex-archive/systems/win32/protext/ The file is 1.7GB. Something smells fishy. If there is not a copy of Debian Linux in there I am going to say it is a little too big for my taste.

Then I landed back at the pandoc installation page, where is says

For PDF output, you’ll also need to install LaTeX. We recommend MiKTeX.

I opted for the 64-bit Net installer to see if I could trim down the download a bit. Still, 158MB is better than 1.7GB (11.01 times better to be somewhat exact). I chose the basic install and picked a mirror nearby. In the end I have no idea if that saved anything. I am guessing not. I still feel that it is way too heavy of a requirement for another application. I also have a hard time believing all that weight is necessary. For reference, the source for txt2pdf. A more comparable example would be wkhtmltopdf that clocks in at 13MB. I digress…

After installing it I once again had to reboot (I tried logging out but Windows just sat at the logging out screen until I rebooted). After rebooting I ran:

pandoc Readme.md -o Readme.pdf

This time MikTeX popped up a few times asking to install additional packages. After that, I had my PDF.

Conclusion

Now I just need to figure out how to do the same thing with Visio. For reference, this video pretty much sums up my experience using visio. I think it might be more annoying to use than iTunes.

No comments:

Post a Comment