|
From IEEE Internet Computing
Standards Is HTML in a Race to the Bottom? A Large-Scale Survey and Analysis of Conformance to W3C Standards
The World Wide Web Consortium (W3C) promulgates the HTML standards used on the Web, but it has no authority to enforce the adoption of one standard in favor of another. In this environment, developers have some incentive to ignore up-to-date W3C standards given that the transitional versions of HTML 4.01 and XHTML 1.0 offer most of the capabilities of the newer ones but are less stringent in their requirements. If most Web sites migrate to these "transitional" standards and remain there, future versions might be mere academic exercises for the W3C. The "race to the bottom" is a
familiar phenomenon that occurs when multiple standards compete for acceptance.
In this environment, the most lenient standard usually attracts the greatest
support (acceptance, usage, and so on), leading to a competition among
standards to be less stringent. This also tends to drive competing standards
toward the minimum possible level of quality. One key prerequisite for a race
to the bottom is an unregulated market because regulators mandate a minimum
acceptable quality for standards and sanction those who don't
comply.1,2 In examining current HTML standards, we've come to
suspect that a race to the bottom could, in fact, be occurring because so many
competing versions of HTML exist.
At this time, some nine different versions of HTML (including its successor, XHTML) are supported as W3C standards, with the most up-to-date being XHTML 1.1. Although some versions are very old and lack some of the newer versions' capabilities, others are reasonably contemporaneous. In particular, HTML 4.01 and XHTML 1.0 both have "transitional" and "strict" versions. Clearly, the W3C's intent is to provide a pathway to move from HTML 4.01 to XHTML 1.1, and the transitional versions are steps on that path. It also aims to develop XHTML standards that support device independence (everything from desktops to cell phones), accessibility, and internationalization. As part of this effort, HTML 4.01's presentational elements (used to adjust the appearance of a page for older browsers that don't support style sheets) are eliminated in XHTML 1.1. Our concern is that Web site designers might decline to follow the newer versions' more stringent formatting requirements and will instead keep using transitional versions. To determine if this is likely, we surveyed the top 100,000 most popular Web sites to discover what versions of HTML are in widespread use. Document typesThe W3C maintains the basic Document Type Definitions for each HTML version (see www.w3.org/QA/2002/04/valid-dtd-list.html). HTML 2.0 dates back to 1995 and came from the IETF. However, since 1996, the W3C has produced the HTML recommendations, and, in 2000, HTML became an international standard (ISO/IEC 15445:2000). The most recent HTML recommendation, which the W3C published in 1999, was HTML 4.01, with errata produced in 2001. More recently, the W3C has concentrated on XHTML, a reformulation of HTML into an XML vocabulary. The most significant difference between HTML and XHTML is the requirement that the document be well-formed and that all elements be explicitly closed, as with XML. XHTML 1.0 became a W3C recommendation in 2000 and XHTML 1.1 in 2001. There are three types of XHTML 1.0, each with an analogous HTML 4.01 version:
XHTML 1.1 is a reformulation of XHTML 1.0 Strict, using XHMTL Modularization. The deprecated HTML features (such as presentational elements and framesets) that XHTML 1.0 Strict still allowed (such as frames) have been removed from version 1.1. Presentational elements are thus restricted to style sheets, and older browsers that can't comprehend CSS will have difficulty with XHTML 1.1. Browser supportDavid Hammond, an advocate for standards-based Web technologies, examined mainstream browsers' level of compliance with common Web technologies, recommendations, and standards (see www.webdevout.net/browser-support). With respect to HTML and XHTML, he explored compliance in terms of functionality for each and every language element within these recommendations. Table 1 summarizes browser support for HTML 4.01, XHTML 1.0, and XHTML 1.1. The Internet Explorer browser currently lags behind the Firefox 2 and Opera 9 browsers in compliance with current HTML and XHTML standards. However, note that even these browsers are far short of compliance with XHTML 1.1.
Clearly, the tools used to produce HTML or XHTML documents also significantly impact which document types are used. We have little insight into which tools are dominant because the market is itself fragmented. Although many Web sites are hand coded or created using WYSIWYG editors (such as Adobe DreamWeaver [www.adobe.com] or MS Expression Web), others are dynamically generated from database or business logic back ends (Apache Tomcat Struts, Adobe ColdFusion 4, and Microsoft .NET) using tier generators (Iron Speed and Blue Ink). We aren't aware of any studies that establish these differing approaches' prevalence. The surveyTo perform our survey, we used the W3C HTML Validator (validator.w3.org/source/) to test each Web site's main page; we assume the Document Type Definitions remain constant across the entire site. We performed the survey on Alexa's top 100,000 Web site list during Fall 2006. Alexa composes and ranks its list based on the geometric mean of the number of individuals visiting a site and the number of pages they access while there. According to Alexa.com, Web sites that aren't on this list have less than a 0.00125 percent chance of being visited by the average Internet user.3 We determined these Web sites' geographic location by obtaining their IP addresses and comparing them with an address database purchased from IP2Location, which maps IP addresses to a particular nation and claims to have more than 95 percent accuracy4 (top-level domains don't reliably indicate the actual locations where Web sites are hosted5). Most of the 100,000 Web sites came from 131 different countries; for 1,235 Web sites, the country of origin was unknown or the Web site was unreachable at the time of the query. ResultsWe first investigated the proportion of Web sites that actually included the mandatory DOCTYPE declaration. We suspected that some Web sites had ignored this syntactic rule, relying instead on Web browsers' fault-tolerance capabilities to display their pages. However, we found that this practice was much more widespread than expected. Figure 1 plots the number of valid document types observed for each group of 1,000 Web sites, ranked in order of popularity (centiles). Roughly 20 percent of even the most popular sites don't bother with a DOCTYPE declaration. Although this omission isn't fatal to a Web site, it's emblematic of the broader disinterest in following current Web standards that we've observed. This finding, in fact, echoes prior research, which found that numerous Platform for Privacy Preferences Project (P3P) privacy policies (also just XML documents) were syntactically invalid.6
Figure 1. Valid document types vs. popularity. Roughly 20 percent of the most popular sites don't use a DOCTYPE declaration. In addition, we noticed a slight downward trend, indicating that less popular Web sites are even less likely to provide a DOCTYPE declaration. We used Spearman's correlation coefficient to quantify this trend. We found that a statistically significant relationship does indeed exist between popularity and the proportion of Web sites providing DOCTYPE declarations (p < 0.001). However, the effect size (r) of this relationship is miniscule, meaning that this relationship has no operational significance. Table 2 presents the 14 different document types our survey found to be in use, along with the number of Web sites using each type. The three most common document types (XHTML 1.0, HTML 4.01, and HTML 4.0) are all transitional versions; in fact HTML 4.0 is a deprecated version of HTML 4.01. The strict versions of XHTML 1.0, HMTL 4.01, XHTML Basic 1.0, XHTML 1.1, and ISO HTML (a strict standard under ISO/IEC control) together account for less than 2.5 percent of Web sites surveyed. Furthermore, as previously noted, a full 22 percent of the sites observed didn't provide a document type at all. Clearly, Web site designers are opting en masse for less restrictive standards. A race to the bottom is apparent, in that designers are embracing XHTML 1.0's expanded capabilities but appear wholly uninterested in using the more rigorously defined versions or XHTML 1.1. A significant minority appears content to ignore the mandatory DOCTYPE declaration completely, and trust that fault-tolerant Web browsers will mask this omission.
The results in Table 2 are an aggregate picture across all 131 countries included in the Alexa 100,000 ranking. However, in previous research,7 we observed that Internet standards might be used (or ignored) in very different ways in different nations or cultures. Additionally, Web sites in Asian nations could wish to use the ruby characters available in XHTML 1.1, providing an additional adoption driver for stricter versions of XHTML in those countries. In Table 3, we break the proportion of document types down by country. We include only countries that host more than 1 percent of the total Web sites; all others are collected into the "other nations" category. All document types that account for less than 1 percent of the total are grouped into the "other" category. We found that, although specific proportions vary between countries, the top three transitional standards from Table 2 still dominate the Web sites from every single country. The most interesting difference from Table 3 is how often Web sites provide no DOCTYPE declaration; this varies between nations from 14.1 percent (third most common) in the UK to more than 40 percent of Web sites in China (second most common). In all nations, however, "no DOCTYPE" was either the second- or third-most common result.
Our survey findings imply that the current effort to develop XHTML 2.0 might well be just an academic exercise. The race to the bottom, however, isn't inevitable. Ronald Dye and Shyam Sunder suggest that the US Securities Exchange Commission should deliberately introduce a standards competition;1 they feel a race to the bottom is unlikely because the transparency of American securities regulation is a key selling point for standards users (corporations). This market already comprises companies that self-select for strong standards, which reassures investors. Karim Jamal, Michael Maier, and Sunder describe an actual market inversion2—the most stringent Web privacy seals (BBBOnline and Truste, for example) were also the least expensive and most widely used. In both cases, the benefits of the more stringent standards were perceived to outweigh the greater compliance costs. This points us to a way to end the race to the bottom in HTML: first, XHTML 2.0 must provide significant added value, in the form of new capabilities that aid Web site developers in their objectives. Second, it must have no transitional version—any transitional version would become the de facto standard for Web site design. References
Patricia Beatty is a network engineer in San Francisco. Her research interests include trust and standards compliance in Web-based systems. Beatty has a B.Sc. and an M.Sc. in computer engineering, both from the University of Alberta. Contact her at plbeatty@gmail.com. Scott Dick is an assistant professor of computer engineering at the University of Alberta. His research interests include computational intelligence, data mining, and machine learning. Dick has a PhD in computer science and engineering from the University of South Florida. He is a member of the IEEE and the ACM. Contact him at dick@ece.ualberta.ca. James Miller is a professor of computer engineering at the University of Alberta. His research interests include software verification and validation, and embedded, Web-based, and ubiquitous environments. Miller has a PhD in computer science from the University of Strathclyde. He is a member of the IEEE. Contact him at jm@ece.ualberta.ca.
Department Editor:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




