Project Nayuki


Practical guide to XHTML

Overview

HTML is the primary language for web pages for decades now. Web browsers and programs that consume HTML code always have and continue to handle malformed code in a lax way – they try to silently fix errors to yield mostly reasonable behaviors. But this leniency comes at a cost of subtle edge cases, complicated rules, errors revealed by unrelated changes, nasty surprises, and little incentive to write quality code.

XHTML is a modified version of HTML that obeys XML syntax strictly. It retains all the good features of HTML, requires the rejection of documents with syntax errors, and eliminates unnecessarily complicated behaviors. I believe that XHTML is a useful tool in the real world as an alternative to the absolutely ubiquitous HTML. Practicing what I preach, this website (Project Nayuki) is served as XHTML continuously since the year 2011, and is supported perfectly by all the major web browsers.

This page describes what XHTML is, why you should use it, and how to use it.

How to use XHTML

You can treat this section as a checklist of things to do to write good XHTML code or convert existing HTML code.

Feature HTML behavior XHTML behavior
Media type

Must be “text/html”. Cannot use XHTML mode unless the code is polyglot.

Must be “application/xhtml+xml” or “application/xml”. Check this to be sure; it’s easy to accidentally continue serving a page as “text/html”, which silently disables error-checking and XML features.

Local filename extension

Must be “.html” or “.htm” (lazy). The web browser ascribes the content type “text/html” to the file.

Must be “.xhtml” or “.xht” (lazy) or “.xml”. The web browser ascribes the content type “application/xhtml+xml” to the file.

Character encoding

Several options:

  • HTML code in <head>: <meta charset="ISO-8859-1"> (HTML5)
  • HTML code in <head>: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> (legacy)
  • HTTP response header: Content-Type: text/html; charset=ISO-8859-1

There is no standard default character encoding. If not explicitly declared, web browsers can behave differently.

Several options:

  • All XML documents are treated as UTF-8 by default. This is the best encoding anyway.
  • XML code at beginning of file: <?xml version="1.0" encoding="ISO-8859-1"?>
  • HTTP response header: Content-Type: application/xhtml+xml; charset=ISO-8859-1

Note that meta charset is ignored, but can be included for polyglot code.

XML namespaces

Unnecessary:

  • <html>
  • <svg>

Mandatory values:

  • <html xmlns="http://www.w3.org/1999/xhtml">
  • <svg xmlns="http://www.w3.org/2000/svg">
Void elements

Either self-closing or no end tag:

  • <br/> (compatible with XHTML)
  • <br> (incompatible with XHTML)

(Ditto for link, img, input, etc.)

Either self-closing or end tag:

  • <br/> (compatible with HTML)
  • <br></br> (incompatible with HTML)

(Ditto for link, img, input, etc.)

Attribute values

Three choices:

  • <elem key="val">
  • <elem key='val'>
  • <elem key=val>

Two choices:

  • <elem key="val">
  • <elem key='val'>
Boolean attributes

Two choices:

  • <input type="checkbox" checked="checked">
  • <input type="checkbox" checked> (popular)

One choice:

  • <input type="checkbox" checked="checked">
Special characters

Often optional to escape, but safer to escape:

  • (4 < 9), <a href="P&G">
  • (4 &lt; 9), <a href="P&amp;G">

(However, special rules apply inside <style> and <script>.)

Always escape when outside of CDATA:

  • (4 &lt; 9), <a href="P&amp;G">
Character entity semicolon

Sometimes optional:

  • 7 &times; 3 (proper)
  • 7 &times 3 (lazy)

Mandatory:

  • 7 &times; 3
Named character entities

Numeric available for all characters, plus rich palette for popular ones:

  • &#xA0; &#xE9; &#x2122; (hexadecimal)
  • &#160; &#233; &#8482; (decimal)
  • &nbsp; &eacute; &trade; (named)

Numeric available for all characters, but rich palette only available if using XHTML 1.0 DTD:

  • &#xA0; &#xE9; &#x2122; (hexadecimal)
  • &#160; &#233; &#8482; (decimal)
  • &nbsp; &eacute; &trade; (not always available)
Element/attribute names

Case-insensitive:
<TABLE Class="a" iD="b"></tAbLe>

Always lowercase for features of (X)HTML:
<table class="a" id="b"></table>
(User-defined things can use uppercase.)
(<svg viewBox="..."> must be camel case.)

Style and script elements
  • Plain code (risky unescaped characters, incompatible with XHTML):

    <style>
      body { background: url(?a=b&c=d) }
    </style>
    <script>
      let z = false < true;
    </script>
    
  • Wrapped in an HTML comment (risky unescaped characters, incompatible with XHTML):

    <style>
      <!--
      body { background: url(?a=b&c=d) }
      -->
    </style>
    <script>
      <!--
      let z = false < true;
      // -->
    </script>
    
  • Rephrased to avoid HTML-special characters (safe, compatible with XHTML):

    <style>
      body { background: url("?a=b\000026c=d") }
    </style>
    <script>
      let z = true > false;
    </script>
    
  • Escaped code (safe, incompatible with HTML):

    <style>
      body { background: url(?a=b&amp;c=d) }
    </style>
    <script>
      let z = false &lt; true;
    </script>
    
  • Wrapped in CDATA (almost safe, incompatible with HTML):

    <style>
      <![CDATA[
      body { background: url(?a=b&c=d) }
      ]]>
    </style>
    <script>
      <![CDATA[
      let z = false < true;
      ]]>
    </script>
    
  • Wrapped in CDATA with inner comments (almost safe, compatible with HTML):

    <style>
      /*<![CDATA[*/
      body { background: url(?a=b&c=d) }
      /*]]>*/
    </style>
    <script>
      //<![CDATA[
      let z = false < true;
      //]]>
    </script>
    
CDATA sections

Feature unavailable. However, the text inside style and script elements behave mostly like the XML CDATA feature.

All characters are allowed between the opening <![CDATA[ and the closing ]]>, except for the 3-char sequence ]]>.

Note that ]]> is forbidden outside of CDATA sections, such in element text and attribute values. It should be escaped as ]]&gt;. It can also be escaped as &#x5D;&#x5D;&#x3E;. Or avoiding character entities, it can be represented by splitting across two CDATA sections like <![CDATA[...]]]]><![CDATA[>...]]>.

Comment blocks

Can contain extra double hyphens:
<!------ example -- -- -->

Must not contain extra double hyphens:
<!--==== example - - -->

Element.innerHTML

Accepts arbitrary text and is parsed according to HTML rules and error correction:
document.querySelector("body").innerHTML = "<b><i>X & Y < Z</b></i>";

Must be a well-formed XML fragment:
document.querySelector("body").innerHTML = "<b><i>X &amp; Y &lt; Z</i></b>";

Element.tagName

Always uppercase:

<script>
let elem = document.createElement("img");
console.log(el.tagName);  // "IMG"
</script>

Always lowercase:

<script>
let elem = document.createElement("img");
console.log(el.tagName);  // "img"
</script>

Advantages of XHTML

XML syntax

HTML is a one-of-a-kind language, and having knowledge of its intricate rules is not transferrable to other languages. By contrast, XHTML is an application of XML, which means it follows all the syntax rules defined by XML. XML is used in other data formats like SVG, MathML, RSS, configuration files, and more. You only need to learn XML syntax once, and it covers many technologies.

Advanced web developers will need to learn XML at some point in their careers. Even if you invest your career in HTML5, you cannot avoid XML in the long run. Whereas if you choose to use XHTML, you could get away with not having to learn the quirks of HTML syntax.

XML tools

Because XHTML is an XML data format, you can use generic XML tools and libraries to generate, manipulate, and parse such data. XHTML is also amenable to XML technologies like XSLT and embedding XML documents within another. Meanwhile, HTML is a unique technology with its own tools and tag-soup parsers, applicable to nothing but HTML.

Simpler syntax

In HTML, bare ampersands and less-than-signs are allowed in many but not all places, e.g.: (0 <= i && i < n), <a href="example?abc=xyz&foo=bar">. In XHTML, ampersands and less-than-signs must be escaped (except in CDATA blocks): (0 &lt;= i &amp;&amp; i &lt; n), <a href="example?abc=xyz&amp;foo=bar">.

In HTML, element and attribute names are case-insensitive: <HTML LaNg="en"><body></BODY></hTmL>. In XHTML, the predefined names are all in lowercase: <html lang="en"><body></body></html>.

In HTML, element attribute values have 3 syntax choices: <element aaa=NoQuotes bbb='single quotes' ccc="double quotes">. In XHTML, only single quotes and double quotes are allowed.

In HTML, Boolean attributes can be written minimally like <button disabled>. In XHTML, all attributes must have values, and the conventional value of a Boolean attribute is the name itself, like <button disabled="disabled">.

No optional start tags (implicit elements)

Coming from the days when HTML was defined as an SGML application with a rich DTD, HTML exhibits a number of surprising implicit behaviors. Some elements are implicitly inserted even when you don’t write their start tags. For example, this HTML code for a table:

<table>
    <tr>
        <td>Alpha</td>
        <td>Beta</td>
    </tr>
</table>

is interpreted as the following DOM tree in memory (which affects scripts and styling):

<table>
    <tbody>
        <tr>
            <td>Alpha</td>
            <td>Beta</td>
        </tr>
    </tbody>
</table>

As a more extreme example, this entire HTML document:

asdf

gets implicitly wrapped in a bunch of elements (and this behavior is standardized across all HTML5 browsers):

<html>
    <head></head>
    <body>asdf</body>
</html>

However in XHTML, elements are never implicitly added to the DOM tree – what you see is what you get; the written code matches the machine’s interpretation.

No optional end tags (forbidden nesting)

HTML simultaneously defines some elements as having optional closing tags and disallows some combinations of nested elements. For example, a paragraph will terminate another – this code:

<p>The quick brown fox
<p>jumps over the lazy dog

is interpreted as:

<p>The quick brown fox</p>
<p>jumps over the lazy dog</p>

Similarly, <li> and <td> will close the previous one. But HTML’s rules make it impossible to nest a <div> inside of <p>, because this:

<p><div>Hello</div></p>

is actually interpreted as:

<p></p>
<div>Hello</div>
<p></p>

which is a gross mangling of the coder’s intent. It is still possible to put <div> into <p> via XHTML or JavaScript.

No special error correction

In HTML, these two examples of unclosed tags:

<p><span>One</p>
<p>Two</p>
<p>Three</p>

<p><em>Four</p>
<p>Five</p>
<p>Sax</p>

get interpreted differently like this:

<p><span>One</span></p>
<p>Two</p>
<p>Three</p>

<p><em>Four</em></p>
<p><em>Five</em></p>
<p><em>Six</em></p>

This means that if you forget to close some types of tags, they could keep replicating until the end of the document. Both examples above are syntax errors in XHTML and will not be corrected implicitly.

No special void elements

Some elements in HTML are defined as void/empty, which means they never have an end tag and cannot contain any child elements or text. The misuse of void elements causes divergent behavior. For example, <br>test</br> results in the DOM tree <br/>test<br/> (the </br> becomes a standalone <br>); but <img src="jkl">0123</img> yields <img src="jkl"/>0123 (the </img> is deleted).

Here’s another example where structurally similar pieces of HTML code are interpreted differently, because <span> is not a void element but <track> is a void element:

<div>
	<span>
	<span>
</div>

<video>
	<track>
	<track>
</video>

becomes the DOM tree:

<div>
    <span>
        <span>
        </span>
    </span>
</div>

<video>
    <track/>
    <track/>
</video>

XHTML has no special treatment for void elements, or any element for that matter. Writing <br/> or <br><br/> will always behave the same. You can also self-close things like <div/>, which is not allowed in HTML (must be written as <div></div>).

No special text for styles and scripts

In HTML, the text inside of <style> and <script> elements are treated specially in a number of ways. For compatibility with ancient browsers, wrapping the text in an HTML comment does not affect the interpretation:

<style>
    <!--
    html { font-family: sans-serif; }
    -->
</style>

<script>
    <!--
    alert("Bonjour");
    // -->
</script>

Furthermore, tag-like pieces of text (other than the end tag) are treated as raw text, not child elements:

<script>
    elem.innerHTML = "<i>Note</i>";
    var b = 0;
    console.log(1<b>2);
</script>

XHTML has no such special treatment for <style> and <script> elements. Characters need to be escaped properly. Commenting out the text will disable it. Having child elements is allowed but probably never what you want. The best practice is to use CDATA:

<script>
    <![CDATA[
    while (i < n && foo()) { ... }
    ]]>
</script>
Bounded errors

Once upon a time, I saw a page where someone listed all the articles of their blog. They used a combination of div and span elements, not ul and li, so no implicit end tags could be inserted. Anyway, for each blog item, they forgot to close a <div> element, so the nesting of DOM nodes kept getting deeper and deeper. At some point in the document, after reaching a depth of perhaps a thousand nodes, my browser gave up and just ignored all tags from that point on, only keeping the text content (i.e. anything not surrounded by angle brackets). The result was that the bottom of the rendered page was a mass of text with no line breaks or formatting. This problem could have been caught much earlier and more easily had the author used XHTML.

Ubiquitous support

XHTML is fully supported by all the major web browsers for over a decade, like Google Chrome, Mozilla Firefox, Apple Safari, Microsoft Edge, and Microsoft Internet Explorer 9+. You can’t use a lack of compatibility as an excuse to avoid considering XHTML as a technology.

Debugging

You can use XHTML mode as a tool to check HTML code quality without unleashing it in the real world. This can help you detect non-obvious unclosed elements, garbage syntax that got silently skipped/corrected, and risky characters that should have been escaped (primarily < and &). You can write polyglot code so that it passes XHTML syntax checks but still yields the same document content when parsed in HTML mode.

Disadvantages of XHTML

Thoroughly unpopular

After the W3C’s vision of XHTML failed to replace HTML on the web, the conversation around XHTML faded out. Hardly any articles/tutorials/etc. mention XHTML anymore, and those that do are often outdated (like from year 2005). Few people use XHTML technology in practice, which means few can teach how to use it and troubleshoot problems. Hence, XHTML is caught in a vicious cycle where the lack of adoption is self-perpetuating.

Unfamiliar strictness

For the longest time, the basic triad of web technologies – HTML, CSS, JavaScript – has been extremely forgiving toward syntactic and semantic errors as compared to more traditional machine languages (XML, Java, Python, etc.). Most commonly, erroneous elements in HTML/CSS/JS are either skipped (e.g. unknown tags) or fixed (e.g. forgot to close a tag). The web development community has propagated this mindset of non-strict syntax through actual code in the real world, the tone and content of tutorial materials, and the lack of talk about detecting and fixing errors. So, XHTML’s very strict syntactic requirements fly in the face of this culture of tolerance and might require some getting used to.

To be more specific, browsers exhibit a few behaviors when encountering an XML syntax error. Mozilla Firefox is the strictest, giving a “yellow screen of death” that shows nothing but the piece of code in error and its character position in the file/stream. Google Chrome is somewhat more forgiving, rendering the contents of the document up until the error (i.e. a prefix), along with a syntax error message like Firefox. In either case, the error is hard to ignore, and the author must fix it to make the page fully functional. In contrast to this draconian erroring, parsing a page in HTML mode can never experience a fatal error, but the resulting interpretation of the content could be anywhere from subtly to grossly wrong.

No document.write()

Back in the 2000s, JavaScript code often used document.write() to add HTML elements to a page (for actual content) or dump some text for debugging. This is no longer possible in XHTML because the JavaScript engine is no longer allowed to inject text into the XML document parser. The advantage is that it decouples XML from JavaScript: A document parser doesn’t require a JavaScript engine, and parsing can always finish in a finite amount of time (whereas JS code is undecidable). The workaround for document.write() is to instead manipulate a document through the DOM API (available for both HTML and XHTML).

Hard to use in existing frameworks

Writing XHTML code from scratch by hand is not hard at all. Retrofitting an existing application software – say a web forum – to XHTML is a significant undertaking. Converting a large platform like WordPress, along with its marketplace of third-party content like plug-ins and themes, is essentially impossible.

I have suspicions that web frameworks like Django, Ruby on Rails, etc. come with many components that generate or manipulate HTML content, but you end up reimplementing things from scratch if you want to do XHTML instead. Similarly, I believe there exist many templating engines and mini-languages out there that cater to HTML but work poorly for XHTML.

Some third-party problems

I personally encountered these issues at some point in time, and they might persist to the present day:

  • If a web page is in XHTML mode and uses Google AdSense’s JavaScript library, it fails to display advertisements.

  • If a web page is in XHTML mode and uses Stripe Checkout’s JavaScript library, the functionality fails.

  • If you go on LinkedIn and post a link to a web page that is served in XHTML mode, then LinkedIn fails to render a preview (text and image) for that link.

  • If the Internet Archive Wayback Machine saved a web page that was served in XHTML mode around year 2018, then it injects some elements and scripts into the page in a syntactically erroneous way, such that the user sees a completely broken archived page when trying to view it.

Notes

Continuous validation

In the early 2000s, it was popular to run your HTML code through the W3C Validator service to check the overall syntax and element attributes, allowing you to fix errors that your browser didn’t tell you about. Presumably this practice helped to prepare a transition to XHTML, but the transition never came, and people gradually stopped doing or discussing code validation. Thankfully, if you serve your web pages in XHTML mode, then the strict XML parsing serves as a basic layer of mandatory validation that ensures your code is at least syntactically well-formed.

Polyglot code

It’s possible to write markup code that heavily relies on XHTML/XML features and breaks subtly or badly when parsed in HTML mode. This works with all modern web browsers for many years now, and 90+% of users will see everything perfectly. But coding this way can shut out very old browsers as well as headless tools like bots/spiders/crawlers/analyzers/archivers that might be unaware that there exist fringes of the web that are not tag-soup HTML. Also, some libraries or services that you (the developer/designer) may choose to use might be broken for XHTML mode, thus forcing you to use HTML. For these reasons, it’s a good idea to write polyglot code that works correctly even when served as the text/html media type and parsed as HTML, so that you can always revert to that mode as a last-ditch solution.

Non-polyglot code

If you are happy with HTML5 and the HTML syntax, then don’t bother writing polyglot code. I see many web pages with code like <link ... />, which I assume was mindlessly copied from tutorials that mindlessly copied best-practice recommendations from many years ago. I interpret this as a form of cargo-culting because these developers probably haven’t heard of XHTML before, and likely have no intention to switch the content type to application/xhtml+xml.

Document type declarations

HTML5 documents in HTML mode must have the DOCTYPE of <!DOCTYPE html>. Older versions such as HTML 4.01 had longer and more intricate DOCTYPEs, whose functionality interacted with full-blown SGML parsers.

HTML5 documents in XHTML mode ignore the DOCTYPE because that’s the nature of XML parsing. Declaring <!DOCTYPE html> is fine for polyglot purposes. The old XHTML 1.0 and related versions had a bunch of DOCTYPEs available, but they are no longer relevant.

HTML round-trip alteration

There are many DOM trees that, when serialized as HTML code and reparsed in HTML mode, cannot recreate the original DOM tree. For example, the tree <p><div></div></p> can be serialized (by reading Element.innerHTML) into the code “<p><div></div></p>”, which can be parsed (by writing Element.innerHTML) into the tree <p/><div/><p/>. This shouldn’t be too surprising because the HTML parser forbids certain kinds of nesting such as div in p.

This train of thought goes further, though. There exists at least one piece of HTML code C0 such that if you parse C0 into the DOM tree T0, then serialize T0 into the HTML code C1, then parse C1 into the DOM tree T1, the trees T0 and T1 are different. Because of this fact, trying to “sanitize” HTML code by running it through some cycle(s) of parsing and serialization might not catch the fragments of elements that you wish to disallow. On the other hand, XHTML (and XML in general) is round-trip-safe and has consistent and reasonable rules for parsing and serialization, so it is easy to devise a correct algorithm to sanitize such code.

SVG code in HTML

Although HTML treats self-closing tags as just start tags, the distinction is important when embedding an XML format like SVG or MathML into HTML. For example, this code:

<html><body>
    <svg>
        <rect>
        <rect>
    </svg>
    <svg>
        <circle/>
        <circle/>
    </svg>
</body></html>

is interpreted as:

<html><body>
    <svg>
        <rect>
            <rect>
            </rect>
        </rect>
    </svg>
    <svg>
        <circle></circle>
        <circle></circle>
    </svg>
</body></html>

In contrast, embedding SVG in XHTML just requires setting the xmlns on the svg element, and then all the other syntax behaves the same because both formats are based on XML.

History

Although I didn’t experience much of this history personally and can’t offer a super-detailed story, there should be enough mentioned here to let you search relevant topics like SGML and read more deeply into them. The Wikipedia pages on these topics provide a tremendous amount of detail already.

In order to understand HTML, we must acknowledge its parent, SGML – the Standard Generalized Markup Language. SGML is what gave us the familiar <start> and </end> tags, as well as attributes, character entities, and comments. I presume that back in the day, SGML was used internally within organizations as a way to represent structured textual data as well as rich text documents.

HTML was designed as an application of SGML, which means defining a set of tags and attributes and semantics, and also implementing the common rules and features prescribed by SGML. HTML was supposed to be parsed with an SGML parser, which would support all sorts of features like document type definitions (DTDs), omitted tags, null end tags, and more. But instead, it seemed that web browsers throughout history never implemented SGML fully; instead, they had ad hoc and incompatible parsers that didn’t handle all types of correct HTML code or incorrect code with any consistency. The result was that in practice, HTML was never treated as a form of SGML, nor was it even a standard – it was just a hodgepodge of whatever features and bugs the major browser vendors supported at any point in time.

HTML debuted in the early 1990s and evolved quickly in its first couple of years. The community defined the major versions 2 (first public standard), 3, and 4, along with a few minor versions. These versions changed the set of tags and attributes (mostly adding to them in a backward-compatible way) while retaining the basic SGML syntax.

Within a few years of HTML’s release, the generic language known as XML was created by drastically simplifying SGML into a small set of features. New data formats started using XML as their basis, and the World Wide Web Consortium (W3C) decided that the future of HTML would also be based on XML syntax instead of SGML (or ad hoc parsing). The first version of HTML based on XML was XHTML 1.0, which was essentially HTML 4.01 with a handful of tiny syntactical tweaks but no change in elements/attributes/semantics. Some later versions of XHTML added features without much problem, but XHTML 2 was a proposal that radically reorganized existing features in an incompatible way, and to the best of my knowledge, no major software ever implemented it.

Although the W3C was hard at work proposing and revising the XHTML standard for about a decade, in the end the effort was largely wasted. Web browser vendors grew weary at the W3C’s lack of practical progress, and formed their own group (WHATWG) in order to advance HTML 4 into HTML5 in a backward-compatible way. Despite the colossal failure of the original XHTML standards from the W3C that drove into a dead end, miraculously the WHATWG quietly acknowledged the XML syntax in a small section of the HTML5 standard, and all browser vendors actually implemented the necessary code so that XHTML documents can use all the features available in HTML5.

Incidentally, HTML5 changed the way that HTML (not XML/XHTML) code is parsed. The standard finally capitulated to these pragmatic facts: Almost all HTML code out in the wild is malformed (whether lightly or heavily), web browsers want to handle errors in a lenient way, and browser makers have no desire to implement the full set of SGML features. To those ends, HTML5 is now defined as a unique snowflake language not based on SGML, it doesn’t support any SGML features that weren’t explicitly included, and error handling is standardized so that all browsers interpret malformed code in the same way (unlike the free-for-all in the past).

More info