Twitter
As the last post indicated I’m part of a team at loc.gov working on an application that serves up page views like this for historic newspapers–almost a million of them in fact. For each page view there is another URL for a view of the OCR text gleaned from that image, such as this. Yeah, kind of yuckster at the moment, but we’re working on that.
Perhaps it’s obvious, but the goal of making the OCR html view available is so that search engine crawlers can come and index it. Then when someone is searching for someone’s name, say Dr. Herbert D. Burnham in Google they’ll come to page 3 in the 08/25/1901 issue of the New York Tribune. And this can happen without the searcher needing to know anything about the Chronicling America project beforehand. Classic SEO…
Imagine you were minting close to a million URIs for historic newspaper pages such as:
http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/
for pages like:
The web page allows you to zoom in quite close and see lots of detail in the page:
Now lets say I want to describe this Newspaper Page in RDF. I need to decide what subject URI to hang the description off of. Should I consider this Newspaper Page resource an information resource, or a real world resource? The answer to this question determines whether or not I can hang my description of the page off the above URI, for example:
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1/> dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .
Or if I need to mint a new URI for the page as a real world thing:
<http://chroniclingamerica.loc.gov/lccn/sn85066387/1898-01-01/ed-1/seq-1#page> dcterms:issued "1898-01-01"^^<http://www.w3.org/2001/XMLSchema#date> .
AWWW 1 provides some guidance:
It’s been great to see RDFa being picked up by web2.0 publishers like Digg and MySpace. You can use the RDFa Distiller to extract the RDFa from a given web page u by constructing a URI like:
http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=u
Which translates kind of nicely into a command line utility to add to your ~/bin:
#!/bin/sh curl "http://www.w3.org/2007/08/pyRdfa/extract?format=turtle&uri=$1"
So with that little shell script in hand I can now look at the RDFa something like Yo La Tengo’s page on MySpace:
ed@rorty:~$ rdfa http://www.myspace.com/yolatengo @prefix myspace: <http://x.myspacecdn.com/modules/sitesearch/static/rdf/profileschema.rdf#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xhv: <http://www.w3.org/1999/xhtml/vocab#> . @prefix xml: <http://www.w3.org/XML/1998/namespace> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . <http://www.myspace.com/YO LA TENGO> a myspace:MusicProfile ; myspace:profileType "Music" . <http://www.myspace.com/yolatengo> xhv:stylesheet <http://x.myspacecdn.com/modules/common/static/css/global_j03fjftp.css>, <http://x.myspacecdn.com/modules/common/static/css/header/profileheader008.css>, <http://x.myspacecdn.com/modules/common/static/css/myspace_jvtnwmp4.css>, <http://x.myspacecdn.com/modules/common/static/css/profile_adl4r-y8.css>, <http://x.myspacecdn.com/modules/profiles/static/css/musicv2_wo4zzzd-.css> ; myspace:addToFriends <http://friends.myspace.com/index.cfm?fuseaction=invite.addfriend_verify&friendID=91362837> ; myspace:friendCount "33993" ; myspace:headline "\"<b>YO LA TENGO IS MURDERING THE CLASSICS</b>\""^^rdf:XMLLiteral ; myspace:photo <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> ; myspace:sendMessage <http://messaging.myspace.com/index.cfm?fuseaction=mail.message&friendID=91362837&MyToken=62964687-f06b-4b8b-8227-ba97f133a029> ; myspace:viewPictures <http://viewmorepics.myspace.com/index.cfm?fuseaction=user.viewAlbums&friendID=91362837> .