Is (X)HTML validity necessary for RDFa?

Is it necessary to produce valid (X)HTML when you want to add RDFa annotations?

What RDFa parsers and distillers do when they encounter invalid mark-up? Some of the distillers throw a parsing error upfront, others continue to parse up to the point where the syntax error is, and output only a first few triples yielded from an invalid page. What about Facebook using the Open Graph Protocol - I have seen many pages using this mark-up that don't validate - does Facebook employ a more fault-tolerant parser?

I hear a lot about RDFa being simple to add because you "just need to annotate your HTML pages with a little bit of extra mark-up". This implies that your mark-up is already valid, which is often not the case, especially when your pages are generated on the fly by a content management system, where it may be difficult to modify your HTML (e.g., add the correct DOCTYPE).

I think this is a huge stumbling point for RDFa use in practise. First, you need to tidy your HTML (and possibly beautify it in order to know what's what) so that it validates before you can start adding RDFa. When I was trying to enrich the existing pages with RDFa, I've discovered that this preparatory step can be very difficult to get done.

I would like to hear if, in fact, it is necessary to add RDFa only into valid (X)HTML documents? Or, does it make sense to add RDFa even to invalid (X)HTML code? Is invalid (X)HTML a real issue hindering the use of RDFa in practise? Will major RDFa users (search engines, Facebook) handle erroneous mark-up or ignore it? Is there anyone more knowledgeable about RDFa parsers that can shed some light on this?

EDIT: The question re-phrased: Do major RDFa users (search engines like Google or Yahoo, Facebook) care about validity of (X)HTML+RDFa?

Gut reaction: At a basic level (properly closed tags) I guess the answer is yes, but on a broader level, probably not. This doesn't say anything about the parsers, as these may be looking at DOM on a broad level.

As we discovered (off list), RDFa validates some places and not others — intelligently written parsers, i.e. ones that do not rely on XML-like structure, will be able to make sense of documents that are well defined given the issues you mention.

From the point of view of someone who uses HTML5 and RDFa, sure, validation is nice, but it isn't possible in every case because the validators just aren't up to scratch — the same will be true of parsers.

My advice to alleviate issues would be to consider (RDFa) annotations in the semantics of your HTML and CSS rather than drop big hidden divs. Therefore ideally you should provide classes that represent annotations in different contexts i.e in cases where the HTML + content doesn't lend itself to inline annotation (and you have to embed rather like embedding metadata) and instances where it does.

It is important to recognise that RDFa is primarily a Semantic Web enabler not necessarily a way to annotate fragments of HTML (a secondary use case).

Well, I would say that adopters like Facebook do not really care about RDFa validity. However, they care about the validity of their own contribution - the Open Graph Protocol. Often you don't even have to define the namespace of OGP.

I'm with brixmat, it is very much dependent on the parser they use for the actual HTML. If the parser is built upon a fault-tolerant HTML parser i.e. something like the quirks mode capable parsers that all browsers employ then they should parse the RDFa fine. Yes properly closed tags etc. should be used and are probably a must for the less tolerant parsers to work correctly.

Looking at the specification for RDFa 1.0 wrt conformance it doesn't actually explicitly say that the (X)HTML document must be well formed.

Of course there is the issue to consider that particularly badly formed (X)HTML may cause the RDFa attributes to result in unintended triples but I've yet to see an example of (X)HTML+RDFa that was that bad :-)

Formally, RDFa 1.0 is bound to XHTML, ie, it requires valid XHTML indeed. The question is what distillers do. The RDFa 1.0 of my RDFa distiller tries to parse the content with an xml parser (Python's minidom parser) and, if the parser fails, it uses the HTML5 parser in Python to parse the content. If even that cannot parse the content, then you are really out of luck...

The RDFa 1.1 is a bit different insofar as it has a draft for RDFa1.1+HTML5. Ie, it is based on the more permissive version of HTML. The shadow distiller (i.e., beta for RDFa 1.1) that I have already out there looks at the media type and, if it is text/html, then the HTML5 parser in Python is used completely bypassing the xml parser.

(I am not sure how the other RDFa tools behave in this respect.)

I hope this helps


From a spec perspective: XHTML is irrelevant, the future is HTML5. HTML5 parsers have well-defined behaviour for many classes of broken markup (including previous versions of HTML). Any HTML5+RDFa parser worth its salt should be expected to be able to deal with markup that doesn't validate.

From a major consumer perspective: Search engines, like browsers, have always been able to deal with invalid markup. They can't afford to be picky because, strictly speaking, most of the web is broken.