Phil Archer

phil@philarcher.org

HTML5 or XHTML? Polyglot Documents Mean You Don't Have To Choose

HTML5 Powered with Semantics

Every Web developer is excited about HTML5 — and rightly so. It adds new features to the primary language of the Web and is designed with experience and practicality in mind.

However, it's not yet completed. As of today, browser manufacturers are all working hard on implementing it but we're a long way of being able to assume full support for HTML5 across the board. The specification documents are in Last Call, meaning that the working group believes the document to be complete subject to receiving and dealing with comments submitted by the community. After that there are further critical stages to go before the press release goes out saying "HTML5 is a W3C Recommendation." (Don't hold me to it but I'd say 2013 at the earliest).

What is stable and fully implemented is XHTML (I hope you'll allow me to leave aside the peculiarities of Internet Explorer or we'll be here all night). So what should you use — the not quite stable but very exciting HTML5 or the older, stable XHTML?

You can pretty much do both at the same time.

I'm about to do it on my own site. Time me. It's .

— done!

i.e. it took me less than 3 minutes to change my site from being exclusively written in HTML5 to being written in both HTML5 and XML: what's called a polyglot document.

Now, OK, I may be being a bit unfair on the timing. I knew what I was about to do, you may have noticed that I said HTML5 and XML, not HTML5 and XHTML, and everything was ready before I started, but let's work through it.

As I noted on , I made a few changes to the markup on this site to change it from XHTML 1.0 strict to HTML5. Now that's an easy transition to make since I was already working in the stricter markup language and I have long been used to validating my pages. So every element was properly closed, ampersands were encoded, element names were written in lower case and so on. You don't have to do this in HTML5 but you can and I do, as much out of habit as anything. The advice is that you should continue as you begin. In other words, for me now to stop closing tags and quoting attribute values, or to start using anything other than lower case element names, would be bad practice — so I haven't stopped doing any of the things I always did in XHTML 1.0 strict.

Because of that, all I had to do to go from XHTML 1.0 Strict to HTML5 back in April was to change the page template so that the top lines went from:

<?xml version="1.0" encoding="windows-1252"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en-GB">
<head>
  <meta http-equiv="content-language" content="en-GB" />
  <meta http-equiv="Content-Type" content="text/html; charset=windows-1252" />

to:

<!doctype html>
<html  lang="en-GB">
<head>
  <meta charset="windows-1252" />

And that was it. I then played around with some of the nice new HTML elements like aside and article but, just to emphasise the point, continued to make sure that elements were closed so that, for example, every <p> is matched by a </p>.

So what I wanted to do today was to make a few changes so that my Web pages could be parsed as either HTML5 or XML (remember the point of XHTML, it's HTML encoded in XML). My reference for all this is Polyglot Markup: HTML-Compatible XHTML Documents. As W3C documents go it's remarkably short. The first line of the abstract tells you what it's about:

A document that uses polyglot markup is a document that is a stream of bytes that parses into identical document trees (with the exception of the xmlns attribute on the root element) when processed as HTML and when processed as XML.

So, if an XML parser happens to wander into my bit of Web space, it'll be happy, as will an HTML parser. Importantly, they will derive an identical DOM. Remember that browsers don't render HTML directly. They create a DOM from the HTML and render that (it's the DOM you manipulate with JavaScript).

Incidentally, notice that the aim is to please both an HTML and an XML parser, not an XHTML parser. Polyglot documents are not valid XHTML.

All of which sounds terribly complicated, not to say arcane, but let me cut to the chase. The first steps I took today were:

  1. add in the XHTML namespace on the root element, i.e. the html element;
  2. add in an xml:lang attribute;
  3. make the word doctype uppercase.

So the top few lines now look like this:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB">
<head>

Notice that I've included the html, head and body elements. These are optional in HTML 5 but required for polyglot documents.

There were a couple of other things to take care of though.

Notice that I don't use the preferred character encoding of UTF-8. This is simply because I use a Windows PC and am used to using an HTML editor that doens't support UTF-8. I could use a different editor of course but, well, I'm comfortable with the one I've used for years (CuteHTML). Looking at the relevant WHATWG's FAQ I notice that for polyglot documents, UTF-8 is the only character encoding that can be declared using the <meta charset="…" /> element. That's because XML character encoding is declared in the Processing Instruction (<?xml version="1.0" encoding="UTF-8"?>). There is no meta element through which you can declare the charset for XML. To get round this I've finally got around to doing what I should have done ages ago and set the character encoding at server level using a one line .htaccess file that simply says:

AddDefaultCharset windows-1252.

I was able to use my HTTP Header viewer to confirm that this worked as expected. Doing this however does produce another warning in the W3C validator which recommends that you include a document level character encoding. Well, I have a reason not to and I'm sticking with it. Let's move on!

Scripts and style definitions can be included within polyglot documents but there are restrictions on the characters you can use and it's easy to forget those little details so the advice is clear: define all your styles and scripts in external files.

On this very simple Web site I don't use any document.write() or document.writeln() in what little JavaScript there is but, if I had done so, I'd need to change it to write to the innerHTML property instead since document.writeln() is not valid in XML.

I do, however, include the Google Analytics code and that had been written within a script element. Not any more — it's now in an external file. This, incidentally, is good practice anyway, especially for mobile. The script is included in every page and so it's better to make it a separate file that can be cached rather than shipping the code with every page. There is no noscript element in polyglot documents by the way.

Finally I had 2 style definitions specific to the home page that were embedded at document level. That approach, document level definitions, seems right but, well, for the sake of copying the content into a little text file and replacing it with a link it hardly seems worth arguing with. What I didn't do though was to copy the styles into the primary stylesheet for the site since that would mean shipping those few bytes with the stylesheet even when they weren't required. Again, for mobile, every byte matters.

Tables

I've covered what I had to do to this site to make documents polyglot. As you can see, it wasn't much. But that's because this is a very simple site, hand coded with a bit of PHP templating. I don't have any need to use tables anywhere but if I did I'd have to make sure that all tr elements were wrapped in one of thead, tbody or tfoot. Likewise any col element(s) would need to be wrapped in a colgroup.

SVG and MathML

This very simple site does not include any SVG or MathML but if it did, I'd have to follow a few extra rules on those. Chapter and verse can be seen in the Polyglot Markup standard.

Summary

Converting an XHTML document into a polyglot document is easy. By following a few relatively simple rules — some of which actively encourage good practice — your markup can be parsed as either XML or HTML5. Add in an HTML5 shiv (I use the one created by Remy Sharp) and you're good to go with a document that is very likely to work as you'd expect in just about any browser.