Update for the filter module: faster caching

Over the last year, the code base of the filter module has matured considerably. Now, the module has all the features that I wanted when I started with it about a year and a half back. The last remaining limitation (in my eyes, at least) was that caching of results required a call to external programs (mtxrun) to calculate md5 hashes; as such, caching was slow. That is no longer the case. Now (since early December), md5 sums are calculated at the lua end, so there is no time penalty for caching. As a result, in MkIV, recompiling is much faster for documents having lots of external filter environments with caching enabled(i.e., environments defined with continue=yes option).

Since, the vim module uses the filter module in the background, recompiling MkIV documents using the vim module will also be much faster. In this blog post, I explain both the old and new implementation of caching.

The filter module works as follows. Suppose, you want to use write an

\startmarkdown
....
\stopmarkdown

to write content in Markdown and use a markdown-to-context tool like pandoc to process the content to ConTeXt. Using the filter module, such an environment can be defined as

\defineexternalfilter
    [markdown]
    [filter={pandoc -t context}]

This defines a markdown start-stop environment. The contents of the environment are written to \jobname-temp-markdown.tmp file, which is processed using pandoc -t context, the result is written to \jobname-temp-markdown.tex file, which is finally read back to ConTeXt.

When a second markdown start-stop environment is encountered, the \jobname-temp-markdown.tmp file is overwritten, and the above process is repeated.

The above process works fine for fast programs like pandoc but for slow external programs, re-running the external program for each compilation slows down processing time. The filter module allows you to cache results by passing continue=yes option:

\defineexternalfilter
    [markdown]
    [
      filter={pandoc -t context},
      continue=yes,
    ]

Now, the contents of the markdown start-stop environment are written to \jobname-temp-markdown-<n>.tmp file, where <n> is a count for the number of markdown environments so far. These files are processed using pandoc -t context, and the results are written to \jobname-temp-markdown-<n>.tex file, which is finally read back to ConTeXt.

Now, to cache the results, all we need to do is check if the \jobname-temp-markdown-<n>.tmp file has changed from the previous compilation. If so, re-process the file using pandoc -t context; otherwise simply reuse the result of previous compilation.

One low-cost method to check if the contents of a file have changed is to store a md5 sum of the contents and check if the md5 sum of the new file has changed or now. The ConTeXt wrapper script mtxrun provides this feature. If you call

mtxrun --ifchanged=<filename> --direct <program>

then mtxrun calculates the md5 sum of <filename>, stores it in <filename>.md5, and runs <program> only if the md5 sum has changed.

I used mtxrun with appropriate options to cache the results in the older implementation of the filter module. (This implementation is still in use in MkII). However, this method requires a call to an external program, mtxrun, for each filter environment. These external calls to calculate md5 sum might be faster than the call the actual external program (like pandoc), but it does take a non-significant amount of time. I had some documents with around 70-80 code snippets that use the vim module, which in turn uses the filter module; and these 70-80 calls to mtxrun took considerable amount of time.

In the new implementation, the md5 sum is computed in Lua and stored in .tuc file. No calls to external programs is required; thus, the processing overhead is minimal. In fact, ConTeXt provides a Lua function job.files.run that takes care of computing the md5 sum and storing it to the tuc file. So, all that I have to do is that instead of calling mtxrun, use:

\ctxlua{job.files.run("<filename>", "<program>")}

The job.files.run function stores the md5 sum in the tuc file, and runs <program> only if the md5 sum has changed. With this implementation, there is very little overhead for multiple md5 sum calculations.

With this change, I think that the filter module is feature complete. From now on, I’ll only be making bugfixes for the filter module and concentrate adding features to my other modules: vim (which is more or less stable now), mathsets (which needs to be re-written for MkIV), and simpleslides (which needs cleanup to keep up with MkIV).

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s