Comments on: When Linked Data Rules Fail

By: David

David — Thu, 18 Feb 2010 06:25:21 +0000

Fred –

Very interesting… haven’t fully read your entire article or all the comments yet…

At the moment I’m going to side with Jim Hendler’s remarks about his project is the FIRST STEP in a long, long journey.

First you have to expose the data to start seeing the discrepancies… some folks report “expenditures” as positive & others as negatives. Who would have thought?

My analogy(s)

– in the physical world there are multiple constituencies (the dairy, the brand aggregator, the distributor, the local food inspector, the grocery store manager, the customer) to make sure that when you go to the grocery store, you ALWAYS find milk in milk jugs. You NEVER find orange juice or motor oil in milk jugs.

We’re not even remotely close to such congruence (e.g. expectations & reality) in the data realm. It’s Tuesday? Well, that must mean we’ll find my mail in your post box.

In the early days of the Industrial Age in the UK, it took 75 years to agree on screw thread standards. And there are still 5 standard threads in the UK today, 250 years later.

First step is to make the data visible… make it possible for MANY eyeballs to look at this crap.

Once people start SEEING that my FOOBAR is your FUBAR then we can move forward with figuring out what to do.

It’s going to be a long, slow process.

Looking at it from another angle… have you ever worked with an organization with SERIOUS data quality issues? What’s the organizational response when you try, ever so diplomatically, to point out data quality issues? Ever tried to talk to a senior manager about data quality issues?

People who work with data know it’s often incomprehensible crap… and they’ve long since learned to keep their personal observations to themselves, even though the bad data is clearly costing the organization serious money.

Just my two cents…

By: Denny Vrandecic

Denny Vrandecic — Wed, 18 Nov 2009 23:43:42 +0000

Hi Glenn,

I think I finally got it — we had a different understanding of the term “model”. I was using the technical term as in “model-theory”, in which case both RDF sets actually are fulfilled by the same model (and thus one cannot be bad and the other good). You were refering to the concrete representation of this in RDF triples. Sorry for my confusion.

“But it would only work if each newspaper used a unique predicate”

Correct.

“which is exactly counter to the idea of common ontologies.”

Mostly correct. I do think that on the SemWeb everyone should be able to extend their ontologies. In that case — and now we come full circle to Fred’s original point — it would be really helpful to formally define that extension with a schema and a mapping. It would make sense to have a generic

news:last_use

property that is somehow linked to nyt:last_use, e.g. like this:

nyt:last_use rdf:type news:last_use .
nyt:last_use news:for_paper glenn:nytimes .

and thus can again relate the common ontology to that one extended by the NYT (This is meant as plain RDF, but is also OWL2 compatible. Yay! And it needs no RIF semantics and inference structure).

In SPARQL it enables the very same queries that are enabled in your system, but quite a number of triples and individuals can be saved in the serialization of the data.

The only reason I am defending the nyt:last_use property as it is, is because I deem it both elegant and correct. We obviously disagree on the “elegant” part.

For an example that actually went wrong, see what linkedgeodata.org is doing. If you use the REST API to query for info about a specific point, like here:

http://linkedgeodata.org/triplify/near/51.033333,13.733333/1000

it contains a triple like this:

node:367589550#id lgeo:distance “954” .

The subject is a location, and it has the distance 954. This is calculated with regards to request you sent, i.e. if you move the requested location around, the distance will change (I pointed them to the bug a few weeks ago, so I don’t feel bad about disclosing it publicly anymore 🙂

This again is elegant (in my opinion), but it is wrong, and there is no way it can be remedied i.e. turned into an acceptable semantics. Here the model is wrong. Whereas at the NYT example we are just speaking about a syntactic, or rather representational issue that can be resolved with a formal specifiable semantics.

So, this is just to explain my misunderstanding, and to why I defended the property as is. I do understand and appreciate your point and regard it as a valid alternative.

Best,
denny

By: glenn mcdonald

glenn mcdonald — Wed, 18 Nov 2009 23:09:56 +0000

Denny:

Yes, that’s a formalization of the transformation from this wrong model to the right one. But it would only work if each newspaper used a unique predicate, which is exactly counter to the idea of common ontologies.

But again, getting the model right is easy, and has really no bad effects on anything, so I see no practical point in defending the existing model, at all. Much less in proposing, to an organization earnestly and admirably trying to just get the Linked Data basics right, some hypothetical solution combining a needlessly bad data-model with an inferrence infrastructure they don’t have and have probably never even heard of!

By: Denny Vrandecic

Denny Vrandecic — Wed, 18 Nov 2009 22:27:51 +0000

Glenn,

that is what I tried to argue with the foaf:schoolHomepage example. You can reify the relation as you suggest, but you don’t have to.

Basically, the following rules resolves the formal relations between your suggestion (using the prefix glenn:) and the current NYT export (using the prefix nyt:) :

nyt:last_use(x,y) = glenn:about(z,x) & glenn:last_use(z,y) & glenn:publication(z,glenn:nytimes)

nyt:first_use(x,y) = glenn:about(z,x) & glenn:first_use(z,y) & glenn:publication(z,glenn:nytimes)

Sorry, it isn’t pretty (the rule is given in datalog syntax, and I am not sure if it can be represented in OWL2 (would need to ask the OWL2 crowd), but it works in RIF. z would be the newly introduced, reifying individual in your solution)

This solution shows that both modeling approaches can have equivalent semantics, thus I find it hard to regard either one as an error (just as I don’t say that the semantics of foaf:schoolHomepage is an error. It’s a choice)

Adding this axiom to a SPARQL endpoint with an appropriate entailment regime will yield the same results wether you use your RDF or the NYT one (besides URI naming).

Sorry for the technicality.

By: glenn mcdonald

glenn mcdonald — Wed, 18 Nov 2009 21:58:18 +0000

See http://fgiasson.com/blog/index.php/2009/11/16/when-linked-data-rules-fail/#comment-286267, above!

By: Denny Vrandecic

Denny Vrandecic — Wed, 18 Nov 2009 21:47:08 +0000

Glen,

“No, this person/reference thing really is a modeling error.”

Sorry, now I am confused. Can you please explain what the error is?

By: glenn mcdonald

glenn mcdonald — Wed, 18 Nov 2009 21:22:11 +0000

Evan, Fred:

Fred suggests ditching the first_use/latest_use entirely, and just deriving them by querying the articles’ dates. But this is really two separate issues:

1. It would definitely be cooler to expand the model to include all the individual articles, with their individual attributes. But obviously this is non-trivial work for you.

2. But even if you did that, it’s very reasonable to have your data-model also go ahead and represent, directly, some extra things (article count, first use, latest use) that *could* be queried, precisely so that they don’t have to be queried. There’s a minor data-redundancy argument against doing this, but it only really applies if you’re *maintaining* the dataset that way, so in this case where the RDF is being (I assume) generated out of some other database of record, it doesn’t matter.

By: Denny Vrandecic

Denny Vrandecic — Wed, 18 Nov 2009 21:19:08 +0000

Hi Fred!

I think we do 🙂

Cheers,
denny

By: glenn mcdonald

glenn mcdonald — Wed, 18 Nov 2009 21:13:46 +0000

Denny:

No, this person/reference thing really is a modeling error. That doesn’t inherently make it a bad *publishing* decision. It’s certainly fairly common to make deliberate modeling errors in order to live with simpler structures that are adequate for some particular purpose. The NYTimes model, as is, is adequate in isolation.

But to the extent that this data is supposed to be mixable with other data (it’s Linked Data, not just Open Data), and especially to the extent that the Times is leading by example, and thus establishing an example that other newspapers might follow, then this data-model is *not* just going to live in isolation, and thus this particular error matters.

And it’s eminently and easily fixable, after all. Doing the model correctly is not appreciably harder or worse for the Times, and it makes the data more usable, not less. So there’s really no tradeoff here.

By: Fred

Fred — Wed, 18 Nov 2009 21:10:30 +0000

Hi Denny!

Okay, I think this certainly gives a good conclusion to this conversation and some kind of “agreement” 🙂

Do we agree? 🙂

Thanks,

Take care,

Fred