rdfa – Garbage Collection

time vs data

There is a bug in HTML5 to remove time and replace it with data.

I think in general that <data> isn’t a bad idea. I agree that only being able to talk about <time> was a bit odd, and that many other values have the same effective use case. As mentioned by Ian Hixen:

dimensionless numbers (2.3)
numbers with units (5kg)
enumerated values, like months (February) or days of the week (Monday)
durations

However, I don’t think that data as it stands right now is done. As this example of the old <time> and new <data> shows there is a fairly heavy semantics loss from the change.

Published <time pubdate datetime="2009-09-15T14:54-07:00">on 2009/09/15 at 2:54pm</time>

Assigned data to RDF via magic human transformation step simply to talk about what data is there. Displayed in Turtle.

 @prefix magic: <https://gavin.carothers.name/vocabs/magic#> . 
<> magic:pubdate "2009-09-15T14:54-07:00"^^xsd:dateTime .

The HTML contains a reasonable amount of data. We know that the contents of datatime is an ISO datatime, we know that the relationship between that date time and the page is it’s publication date.

Published <data value="2009-09-15T14:54-07:00">on 2009/09/15 at 2:54pm</data>

Again magic human transformation to Turtle.

@prefix magic: <https://gavin.carothers.name/vocabs/magic#> . 
[] magic:unknown "2009-09-15T14:54-07:00" .

This time around don’t know much at all. We know that “2009-09-15T14:54-07:00” may some how be related to the current page. We don’t know how the string is formated, nor how the string is related to the page if it is at all.

As currently proposed <data> sure looks like a step backwards, but maybe it can be a step forward.

Just some invalid markup:

<data type="dateTime" value="2009-09-15T14:54-07:00" property="pubdate">

Just some RDFa:

<data datatype="xsd:dateTime" content="2009-09-15T14:54-07:00" property="magic:pubdate">

Just some microdata plus some datatype:

<data type="dateTime" value="2009-09-15T14:54-07:00" itemprop="pubdate">

Adding a generic data element without data typing doesn’t seem like a good idea. With datatyping I can see it working better then creating an element for every kind of data.

schema.org as RDFa (Part I)

schema.org an initiative by Google claims once again that RDFa is too complicated. That’s not really true. Here in fact are the first examples from schema.org in RDFa. There is a bonus as well for using RDFa rather then Microdata… you can test that RDFa is valid and gives you what you expect TODAY. Microdata and schema.org? No validation (STILL!), and no public parsers.

A movie

The first example from Schema.org is about marking up a movie:

schema.org

<div itemscope itemtype ="http://schema.org/Movie">
  <h1 itemprop="name"&g;Avatar</h1>
  <div itemprop="director" itemscope itemtype="http://schema.org/Person">
  Director: <span itemprop="name">James Cameron</span> (born <span itemprop="birthDate">August 16, 1954)</span>
  </div>
  <span itemprop="genre">Science fiction</span>
  <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>

RDFa

<div vocab="http://schema.org/" typeof="Movie">
 <h1 property="name">Avatar</h1>
 <div rel="director">Director:
   <span typeof="Person"><span property="name">James Cameron</span>
   (born <time property="birthDate" datetime="1954-08-16">August 16, 1954</time>)
   </span>
 </div>
 <span property="genre">Science fiction</span>
 <a rel="trailer" href="../movies/avatar-theatrical-trailer.html">Trailer</a>
</div>

That wasn’t very complicated. In fact compared with the example on Schema.org, why do some attributes need fully qualified URIs/IRIs and some don’t? How does microdata know that director is refering to schema.org’s director and not some other one?

Next is Spinal Tap!

<div vocab="http://schema.org/" typeof="Event">
 <div property="name">Spinal Tap</div>
 <span property="description">One of the loudest bands ever
 reunites for an unforgettable two-day show.</span>
 Event date:
 <time property="startDate" datetime="2011-05-08T19:30">May 8, 7:30pm</time>
</div>

Okay not even bothering with the Microdata/schema.org version. There are a few differences but not much. Where is all this complexity that RDFa introduces?

An offer, for a blender

<div vocab="http://schema.org/" type="Offer">
 <span property="name">Blend-O-Matic</span>
 <span property="price">$19.95</span>
 <link rel="availability" href="http://schema.org/InStock"/>Available today!
</div

Yeah, still not seeing why RDFa is more complicated.

Okay, that’s all I feel like dealing with before breakfast. Will look a few more later.

In the mean time, gratuitous baby pictures!

Update:

By specific request, RDFa for geo tagging:

schema.org

<div itemscope itemtype="http://schema.org/Place"
 <h1>What is the latitude and longitude of the <span itemprop="name">Empire State Building</span>?<h1>
 Answer:
 <div itemprop="geo" itemscope itemtype="http://schema.org/GeoCoordinates">
 Latitude: 40 deg 44 min 54.36 sec N
 Longitude: 73 deg 59 min 8.5 dec W
 <meta itemprop="latitude" content="40.75" />
 <meta itemprop="latitude" content="73.98" />
 </div>
</div>

schema.org as RDFa

<div vocab="http://schema.org/" typeof="Place"
 <h1>What is the latitude and longitude of the <span property="name">Empire State Building</span>?<h1>
 Answer:
 <div rel="geo">
 Latitude: 40 deg 44 min 54.36 sec N
 Longitude: 73 deg 59 min 8.5 dec W
 <span typeof="GeoCoordinates">
 <meta property="latitude" content="40.75" />
 <meta property="longitude" content="73.98" />
 </span>
 </div>
</div>

RDFish RDFa

 <div prefix="schema: http://schema.org/ dc: http://purl.org/dc/terms/ pos: http://www.w3.org/2003/01/geo/wgs84_pos#" 
  typeof="schema:Place pos:SpatialThing"
 <h1>What is the latitude and longitude of the <span property="dc:title schema:name">Empire State Building</span>?<h1>
 Answer:
 Latitude: 40 deg 44 min 54.36 sec N
 Longitude: 73 deg 59 min 8.5 dec W
 <meta property="pos:latitude schema:latitude" content="40.75" />
 <meta property="pos:longitude schema:longitude" content="73.98" />
 </div>
</div>

As with the earlier conversions, these are 5 minute jobs without really spending much time thinking about them. But these WORK, can be validated today and are still very simple. The RDFish version above does start to use some RDFa features that are considered confusing. It uses 3 vocabularies rather then just one. To do this it does use the much feared PREFIX. I’ve covered my opinion of prefixes before. I still stand by my statement that prefixes are simply not that complicated. Also, the RDFish version does drop the added GeoCoordinates instance, and intermediate schema:geo property, didn’t really see why they were there. Other possible improvements include adding datatypes to the properties in RDFa, but that’s not really necessary in this case.

RDF Database Expectations

Background

I use RDF databases to store 100% of O’Reilly Media’s product metadata. The catalog pages, shopping cart, electronic media distribution, product registration process, ONIX distribution, and most internal product reporting is based on RDF. The following are observations of what is necessary from a RDF database in order to successfully and easily build a similar system. As for what the clients need to be able to do… working on that. The features required are listed in descending order of priority to me.

SPARQL

An RDF or semantic database that does not support at least SPARQL 1.0 is not interesting. Writing queries in Prolog, XQuery, or another DSL is not acceptable. Getting folks to understand graphs and RDF is hard enough without also having to teach them languages that don’t work easily with graphs.

Correct

When running a simple query that works fine on many other implementations I don’t expect to find INCORRECT results. Throwing errors and saying something is unimplemented isn’t great but far better then returning results that are just WRONG.

SPARQL +

SPARQL doesn’t really do enough without extensions. The features I’ve found to be most useful are LET, and GROUP BY. LET can be used to “fake” bind parameters, create synthetic values for reports and make complex queries much simpler. Without GROUP BY nasty post processing is often necessary to produce summary reports from SPARQL queries. Other helpful functions are the XPath functions, a good set of always useful tools that I already know from years of work in XSLT and XQuery.

Named Graphs

Named graphs allow me to treat the RDF database as a document store. This rapidly reduces the complexity of loading and managing ETL operations. The SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs makes me very happy, and maps neatly on top of solutions that I’d already implemented before I even knew the SPARQL 1.1 Working group existed. See Tenuki for my own implementation of graph updates over HTTP. ChangeSets Talis style are useful too, but have been more complicated to generate then I had expected.

Concurrent

I expect to be able to write updates to graphs and read from graphs at the same time. I’ve encountered limitations related to Multi Reader Single Writer locks at the graph level, dataset level, and server level.

Parses RDF/XML

Twice now I’ve come across products that fail to parse perfectly valid RDF/XML. We aren’t talking complex RDF/XML structures either, just simple Literals and XMLLiterals that contain non ASCII data, or XML mixed content.

Installable

A database server should not require me to write software in order to use it. Simple command line clients and simple start stop scripts should not be too much to ask.

SPARQL EXPLAIN

SPARQL query optimizers tend to be odd beasts. I have found that it’s really easy to go from a query that runs in no measurable time to one that will, for all practical purposes, never complete. Understanding why with EXPLAIN is hard, without EXPLAIN impossible. Profiling would of course be better, but I’ll take what I can get.

Documented

If your RDF database supports a feature but doesn’t document the feature anywhere, it doesn’t support the feature. I’m should not need to read source code to find out what SPARQL syntax and extensions the database supports. If your products is a closed source RDF database Documentation should really be at the top of this list as I can’t figure it out for myself by reading the code.

License

I get that databases are big money. I know Oracle owns MySQL now. It doesn’t matter. Rails, Django, Pylons, Wicket, insert favorite SQL based web framework here would not have existed without a good enough SQL database like MySQL or Postgres. A semantic web framework will be hobbled by only high cost commercial backends. A commercial database means that we can’t contribute fixes even if we want to.

Conclusion

I’ll deal, and do deal with a lack of most of these. But each time one of these features is missing it’s harder and harder for me to sell the idea of using RDF and semantic databases to management. The benefits of using RDF do in fact make up for missing tons of these features, but if we want RDF to accepted as a model for day to day development on a par with SQL databases these need to be addressed.