Is adding language tags to literals a best practice?

ldodds · July 11, 2010, 10:00pm

Do people consider it a best practice to always add language tags to RDF literals?

Or, to ask this slightly differently, are there any good reasons for not specifying a language for a literal property?

DaveReynolds · July 11, 2010, 10:00pm

There is a difference between a string and linguistic text, you should only use a language label for the latter.

Thus for example an rdfs:comment or rdfs:label is arguably always a linguistic text that you might translate to other languages in the future. Those should always have lang tags since having mixed plain-literals/lang-tagged-literals is such a pain to query.

However, for example, foaf:name[*] should not have language tag - my name is my name, while there are equivalent names in other languages those are not my name. Similarly there are times when you are representing the lexical form for something e.g. some user input, a keyword in a programming language, the lexical part of a typed literal etc, those should not have lang tags.

Unfortunately we have no way to declare whether the range of a particular property is expected to be with or without a language tag. [Well using rdf:PlainLiteral, the rdf:langRange facet and OWL 2 you can require a language tag so I guess you can also exclude them but not in a way that is any use in RDFS.]

-- Update

[*] So that's a bit of an oversimplification, as pointed out by Dan C. Proper names are sometimes transliterated to alternative scripts and so lang tags could be used there, indeed the foaf spec explicitly allows for either. I'd personally still argue that is preferable to stick to no lang tags as the common case and use other mechanisms such as aliases/pseudonyms for transliterations but YMMV.

As a more clear cut foaf example then foaf:nick (as in things like IM nicknames and login names) is better.

cygri · July 11, 2010, 10:00pm

If the literal is likely to be understood only by speakers of a single language, then add a language tag. If it is likely to work for speakers of many languages, keep it without a language tag.

If the file or dataset has only a few exceptions, then it is perhaps better/simpler to go for consistency and mark them the same way as the rest of the file.

database_animal · July 11, 2010, 10:00pm

Just the other day there was somebody on the dbpedia list who was trying to do a very simple query (look up a label) and he was scratching his head figuring out how to do it... Until somebody told him he had to add a language identifier to the literal.

Somebody who's had their first experience like that might have a negative impression of language identifier. It's all too easy to get the feeling sometimes that multilingual support is like the tail wagging the dog

On the other hand, I'm thinking of a multilingual project that I did in postgresql many years ago. Keeping track of multiple languages in a relational database is big pain no matter how you cut it. I mean, do you write

create table item (
     id integer       primary key,
     label_en         varchar(200),
     label_jp         varchar(200),
     label_zh         varchar(200),
     ...
 )

or do you create a separate table for text fields that are in each language? Do you build the language id into the table name or do you make it a secondary key on the table? Do you build one big table of string blobs that are language tagged?

No matter which one of those choices you pick in an RDMS, you're going to find it much more difficult to build and maintain the app. When a maintenance programmer comes in three years later, he might even add more bugs than he fixes.

Having lived through that hell, I'm very happy that string literals work they way they do in RDF.