HTML export

The question of translating TeX to (X)HTML arises frequently. Almost everyone wants it. After all, on the web, (X)HTML is the de-facto standard markup; PDF, with all its hyper link abilities, is clumsier to use. On the other hand, for print, PDF, especially, TeX generated PDF is the de-facto standard (at least for math heavy fields); (X)HTML, with all its print css abilities, is clumsier to use. Often, you want both an (X)HTML version and a PDF version of a document. With the popularity of eink devices, epub (which is essentially a zipped (X)HTML file) is also becoming popular. Generating these multiple output formats from the same source is tricky.


The easiest solution, of course, is to use an ascii markup language like markdown, restructured text, asciidoc, etc. that generate (X)HTML and tex output from the same source. For simple tasks these are great: the synatx is fairly intuitive and there are various tools that translation tools are fairly robust. But, these simple markups lack an important ability of TeX: programmable macros. To give an example, suppose I need to write n tuples $(a_1, \dots, a_n)$ fairly often in a document. When using TeX, I am in the habit of defining a macro

\def\TUPLE#1{(#_1,\dots,#_n)}

and then use $\TUPLE a$. This saves typing and prevents typos. Ascii markup languages lack this ability.

A more robust solution is to use XML as the input language and then use XML-tookchain to translate the input to both HTML and TeX (There are various TeX based solution for parsing XML directly, both in LaTeX and ConTeXt). However, typing XML is a pain, epsecially, when it comes to typing MathML.

A solution which meets the needs midway is inputing text in (X)HTML and math in TeX format. Then use mathjax to render HTML and one of the TeX XML processing packages to convert to PDF. I think that until browsers become mature in displaying MathML, mathjax is an excellent ad-hoc solution. In addition, it does parse simplist TeX macros. However, you do loose the ability to write complicated TeX macros.

Of course, if you want to parse TeX macros, the only real solution is to use TeX to parse to parse TeX. Why? Because TeX has the amazing ability to change its grammer on the fly.

\newcatcodetable \weirdcatcodes
\startcatcodetable \weirdcatcodes
    \catcode`\@ = 0
    \catcode`\( = 1
    \catcode`\) = 2
\stopcatcodetable
\setcatcodetable \weirdcatcodes

After this, @ takes the usual meaning of \, ( of {, and ) of }. So, to start a section, you have to use

@section (A section)

This ability of changing catcodes is best illustrated by David Carlisle’s xii.tex

\let~\catcode~`76~`A13~`F1~`j00~`P2jdefA71F~`7113jdefPALLF
PA''FwPA;;FPAZZFLaLPA//71F71iPAHHFLPAzzFenPASSFthP;A$$FevP
A@@FfPARR717273F737271P;ADDFRgniPAWW71FPATTFvePA**FstRsamP
AGGFRruoPAqq71.72.F717271PAYY7172F727171PA??Fi*LmPA&&71jfi
Fjfi71PAVVFjbigskipRPWGAUU71727374 75,76Fjpar71727375Djifx
:76jelse&U76jfiPLAKK7172F71l7271PAXX71FVLnOSeL71SLRyadR@oL
RrhC?yLRurtKFeLPFovPgaTLtReRomL;PABB71 72,73:Fjif.73.jelse
B73:jfiXF71PU71 72,73:PWs;AMM71F71diPAJJFRdriPAQQFRsreLPAI
I71Fo71dPA!!FRgiePBt'el@ lTLqdrYmu.Q.,Ke;vz vzLqpip.Q.,tz;
;Lql.IrsZ.eap,qn.i. i.eLlMaesLdRcna,;!;h htLqm.MRasZ.ilk,%
s$;z zLqs'.ansZ.Ymi,/sx ;LYegseZRyal,@i;@ TLRlogdLrDsW,@;G
LcYlaDLbJsW,SWXJW ree @rzchLhzsW,;WERcesInW qt.'oL.Rtrul;e
doTsW,Wk;Rri@stW aHAHHFndZPpqar.tridgeLinZpe.LtYer.W,:jbye

To parse such TeX, you need TeX.

Now, from TeX’s point of view, (X)HTML is not different from any other backend like DVI and PDF. It just needs to write the output in a specific format to a file. LuaTeX makes this easy and ConTeXt MkIV now supports a XHTML backend. Simply add

\setupbackend[export=yes, xhtml=yes]

in your preamble. This creates a \joname.xhtml file (and a \jobname.export XML file) For example

\setupbackend[export=yes, xhtml=yes]

\starttext
\input ward
\stoptext

gives

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>

<!-- input filename   : xhtml             -->
<!-- processing date  : Sun May 29 22:02:02 2011 -->
<!-- context version  : 2011.05.18 22:26  -->
<!-- exporter version : 0.20              -->

<document xmlns:m="http://www.w3.org/1998/Math/MathML" version="0.20" language="en" date="Sun May 29 22:02:02 2011" file="xhtml" context="2011.05.18 22:26" xmlns:xhtml="http://www.w3.org/1999/xhtml">
The Earth, as a habitat for animal life, is in old age and has a fatal illness. Several, in fact. It would be happening whether humans had ever evolved or not. But our presence is like the effect of an old-age patient who smokes many packs of cigarettes per day --- and we humans are the cigarettes. 
</document>

This is not really an XHTML file. It is just an XML file. With a proper stylesheet, most browsers will be able to display this file. Creating such a stylesheet is easy. A simple example of such a stylesheet is here. To use this stylesheet, save it as, say mkiv-export.css, then add

\setupbackend[export=yes, xhtml=yes, css=mkiv-export.css]

This adds the following line to the generated XHTML output

<?xml-stylesheet type="text/css" href="mkiv-export.css"?>

The XHTML file exports the structure of the document, not the style. Consider the following, more complicated example:

\setupbackend[export=yes, xhtml=yes, css=mkiv-export.css]

\definestartstop[important][style=italic]

\starttext
\section {The first attempt}

The XHTML export with \important{structured text}, but not manual {\em style}
\italic{commands}. But the good thing is that it works with math!

Consider a quadratic equation $ax^2 + bx + c = 0$. The roots of this equation are
\startformula
  x = (-b Β± \sqrt{b^2 - 4ac})/2a
\stopformula
\stoptext

This gives

<?xml version='1.0' encoding='UTF-8' standalone='yes' ?>

<!-- input filename   : xhtml             -->
<!-- processing date  : Sun May 29 22:08:38 2011 -->
<!-- context version  : 2011.05.18 22:26  -->
<!-- exporter version : 0.20              -->

<?xml-stylesheet type="text/css" href="mkiv-export.css"?>

<document xmlns:m="http://www.w3.org/1998/Math/MathML" version="0.20" language="en" date="Sun May 29 22:08:38 2011" file="xhtml" context="2011.05.18 22:26" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <xhtml:a name="aut_1"><section location="aut:1" detail="section">
    <sectionnumber>1</sectionnumber>  
    <sectiontitle>The first attempt</sectiontitle>  
    <sectioncontent>
The XHTML export with <construct detail="important">structured text</construct>, but not manual style commands. But the good thing is that it works with math!

      <break/>
 Consider a quadratic equation 
      <m:math>
        <m:mrow>
          <m:mi>π‘Ž</m:mi>
          <m:msup>
            <m:mi>π‘₯</m:mi>
            <m:mn>2</m:mn>
          </m:msup>
          <m:mo>+</m:mo>
          <m:mi>𝑏</m:mi>
          <m:mi>π‘₯</m:mi>
          <m:mo>+</m:mo>
          <m:mi>𝑐</m:mi>
          <m:mo>=</m:mo>
          <m:mn>0</m:mn>
        </m:mrow>
      </m:math>
. The roots of this equation are

      <formula>
 
        <formulacontent>
          <m:math>
            <m:mrow>
              <m:mi>π‘₯</m:mi>
              <m:mo>=</m:mo>
              <m:mo>(</m:mo>
              <m:mo>βˆ’</m:mo>
              <m:mi>𝑏</m:mi>
              <m:mo>Β±</m:mo>
              <m:mo>󿁰 </m:mo>
              <m:mroot>
                <m:mrow>
                  <m:msup>
                    <m:mi>𝑏</m:mi>
                    <m:mn>2</m:mn>
                  </m:msup>
                  <m:mo>βˆ’</m:mo>
                  <m:mn>4</m:mn>
                  <m:mi>π‘Ž</m:mi>
                  <m:mi>𝑐</m:mi>
                </m:mrow>
              </m:mroot>
              <m:mo>)</m:mo>
              <m:mi>/</m:mi>
              <m:mn>2</m:mn>
              <m:mi>π‘Ž</m:mi>
            </m:mrow>
          </m:math>
        </formulacontent>
      </formula>
    </sectioncontent>
  </section></xhtml:a>
</document>

Notice some features of the output: the section number is exported with the section title (if you change the conversion of the section number, the output will honor that); the structure command \important is exported, the style commands \em and \italic are not; math is exported as MathML with Unicode symbols!. There are some interesting features that we are experimenting with. I’ll post more about the MathML export in the future.

So, if you are interested in XHTML output for TeX sources, play around with the new export feature. It is not perfect … yet. But with MathML and SVG (remeber, Metapost has a SVG backend), it is possible to get a fully working XHTML output for the TeX input generated by TeX rather than a pre- or post-processor. After all, only TeX can parse TeX

\setupbackend[export=yes, xhtml=yes, css=mkiv-export.css]

\starttext
\let\bye=\donothing %xii.tex ends with \bye
\input xii
\stoptext

(The output is here)

Advertisements

10 thoughts on “HTML export

  1. Loved this post – I never thought to generate XHTML from TeX. I alwasy figured you’d want to translate DVI to HTML or something.

    In any case, one remark intrigued me:

    Because TeX has the amazing ability to change its grammer on the fly.

    I have not seen any writing talking about this aspect of TeX, but it was something Knuth clearly valued. Do you know of any references or articles that talk about this aspect of TeX?

    • I don’t know of any articles that discuss “why” Knuth decided to include category codes in TeX. There are a couple of places where they are useful in modern macro packages. For example, the verse environment in LaTeX (or the lines environment in ConTeXt) makes newlines significant. Category codes are also useful for parsing text; tikz uses some catcode trickery to parse input, so does some of the LaTeX packages for displaying chemical formulas, and ConTeXt MkII implementation of parsing XML.

      But I don’t know why Knuth went through the trouble of implementing category codes in TeX.

  2. Your blog is one of the best TeX related blogs on WordPress. How can I tell? Easy you are producing theft worthy code. Keep up the good work.

  3. Great stuff! Do you know if there is anything being done/planned like this for LuaTeX with LaTeX type markup and not the ConTeXt that you show here please?

        • Technically, it’s not a project to convert LaTeX to *HTML*, but to another markup that then produces the HTML (or XHTML+MathML+SVG). It would be easy to add HTML as an output format – I just haven’t done that yet as I figured that TeX4ht was a pretty good solution for going all the way from LaTeX to (X)HTML. I now have a working system (it produced the blog post you linked to, and I’m using it to produce nLab pages) available from . I should also say that it was your comments on one of my TeX-SX questions that got me over the initial difficulty in this.

  4. This is really neat. I’ve been looking for a way to do something like this–the flexibility of TeX, but with output in multiple formats. Where is the source code for the xhtml exporter? I’m wondering how easy it is to tweak.

    Two notes on my own efforts in this direction: First, pandoc does allow you to define your own simple LaTeX macros and applies them to math. This gives you the power of macros in every output format, but only for math — though of course you can’t do everything you can do in TeX. Pandoc has many options for displaying math in HTML, including mathjax and MathML.

    Second, I have a half-finished project, HeX (https://github.com/jgm/HeX), that reads a LaTeX-like language and produces LaTeX, HTML, and whatever else you like. You write macros in Haskell, and the macros can specify different results for different output formats (example: https://gist.github.com/1115661). So far it is pretty basic, but it does support MathML output in HTML. I still like the idea, but if something similar could be done using luatex, that might make more sense, I think.

    • The main source for HTML export is in back-exp.lua. Most of it is concerned with creating a OM tree out of a flat TeX file. Keeping track of tags was already part of ConTeXt (needed for tagged pdf). The current mechanism is not too flexible. There was some discussions on the ConTeXt mailing list about a markdown export, but nothing really came out of it.

      Thank you for pointing out that pandoc processes macros. Somehow, I had missed that. I was not aware of HeX. It looks interesting. Thanks.

      You might also be interested in knowing that ConTeXt now uses your lpeg parser for markdown to natively process markdown. See m-markdown.lua. Currently, the parses is not as robust as pandoc. I will blog about that sometime in the future. The main advantage of the lua bases parser is that there is no need for an external binary.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s