Introduction to the Semantic Web (CityGML+ARML Workshop)

Preface

Because the CityGRML+ARML Workshop has been compressed from two days down to one, we will have to cover several topics very quickly. One of those topics is the Semantic Web. There is no time in a six-hour Workshop to present a thorough "Introduction to the Semantic Web, Linked Open Data, RDF, OWL, and SPARQL." Lokewise, this blog post will not be a complete Semantic Web Primer. Instead, I will document the concepts that I will discuss and demonstrate in the first session, so that Workshop attendees can review them later at a more leisurely pace.

Introduction

The hypothesis that we will explore in the Workshop, is that CityGML can be used not only as data, but as an index to external data sets. This can be accomplished by embedding links to external semantic data sets in the XML. These links are gracefully ignored by CityGML applications, but can be explored using the query languages XQuery and SPARQL. By doing so, CityGML becomes a visually browsable, 3D, "Universal Index of the Physical World" (TM).

In order to understand why we chose to embed semantic links, rather than some other technology like XML, SOAP, or relational databases, it is helpful to be familiar with the features and characteristics of semantic technology. Unfortunately, we only have time to briefly touch upon these concepts, but the reader is encouraged to expire further.

What we mean by the "Semantic" Web

What makes the Semantic Web "semantic?" Semantic implies meaning. The goal of the Semantic Web is to make information discoverable and understandable by computers. A Semantic Web supports artificial intelligence (inductive and deductive reasoning by computers). The best way to demonstrate what semantic is, is to show something that is not semantic. XML is not semantic. For instance, consider the following XML snippet (I switched angle brackets for parentheses so it will render in the blog):


(Name)George Washington(/Name)

Seems straightforward, Right? Wrong! Does the element "Name" mean the name of a President? Of a University? Of a bridge? Of an owner? Of an author? Of an ancestor? Of a descendant? Of a decedent? Of a party in interest? Of a book? Of a movie? Of a song? Of a racehorse? Of a boat? Of a hotel? Of a hotel conference room?

In fact, to a computer, the above snippet just looks like this:

(||||)George Washington(/||||)

In other words, the tag "Name" means absolutely nothing to the computer, it is just a string (||||).

XML is not a data modeling language. It is a data exchange language. It works best when the sender and the receiver have agreed before-hand on the message format. For this purpose, an XML Schema is used. XML can basically only define relationships through containment (of one element within another). In that way, we can figure out what the element "Name" means:


(Owner)
(Name)George Washington(/Name)
(/Owner)

versus

(President)
(Name)George Washington(/Name)
(/President)

versus

(Author)
(Name)George Washington(/Name)
(/Author)

etc...

When an XML file (or an XML schema) contains a lot of information or contains subtle relationships, it results in an enormous amount of nesting of elements (containment), and the files become huge, unwieldy, and no longer human readable. Furthermore, so many elements become optional and nillable, that the schema ceases to function as a consistently enforceable grammar. This is a common criticism of the National Information Exchange Model (NIEM). XML has other failings as a data modeling language, too. For one thing, it is very verbose. Both the data and the metadata (the element tag names) appear in the same document. In fact, the metadata appears twice in the same document (opening and closing tags)! But probably the biggest problem with XML is that it is not agile. Sender and receiver must agree on the same schema, which makes applications that rely on them very brittle, and pegs them down to release cycles.

Resource Description Framework (RDF)

The Resource Description Framework (RDF) was invented in 1999 to address these shortcomings. The hyperlink might have been the invention that made the World Wide Web "take off," but the technology that Tim Berners Lee intended to form the backbone of the Internet was RDF (see note. That is because RDF is semantic; it works well in the distributed, networked World Wide Web environment; it is agile; and it is robust.

The foundation of RDF is the Triple. A triple is a statement that contains three parts: a Subject, a Predicate, and an Object. The Subject and Object are nodes representing concepts (usually nouns), and the predicate is the edge representing the relationship between them (usually a verb). For instance:


(Alex) ----(likes)----> (Donuts)

RDF can be serialized (written down) in a number of formats, including XML, JSON, and plain text. A collection of triples defining a set of concepts and the relationships between them is called an ontology. Two particular RDF ontologies, RDF Schema and OWL, create the familiar concepts of classes and properties, and other important reasoning concepts, such ass symmetry, transitiveness, exclusivity, cardinality, intersection, union, etc.

RDF has its own query language, called SPARQL. SPARQL looks a lot like SQL, but with very important differences. First of all, SPARQL is RESTFUL. It is transmitted as simple text (GET, POST, or PUT) over HTTP, and responses come as simple text (XML or JSON) over HTTP. Secondly, it operates on RDF triples, which are inherently agile and open. There are no primary-foreign key relationships known only to a DBA. Radical conceptual changes can be accomplished by merely inserting or deleting a single triple, for instance, adding the single triple


ex:Dog owl:subClassOf ex:Mammal

will automatically make all Dogs inherit properties from Mammal. Deleting the single triple will remove the inheritance. You do not need a DBA to define Dog/Mammal table relationships, add a dog_fl column to the Mammal table, etc. Object-oriented middlewarecode does not need to be modified to say "Class Dog extends Mammal, etc.

Furthermore, RDF deals extremely well with polymorphism, inheritance, "mixing," and one-offs. for instance, there is no problem with a Dog being both an Animal and a Pet, and there is no problem with a particular dog being "just plain mean." You don't have to add a "meanness" property to the Dog class before you can add a triple saying that a particular dog is mean. There is absolutely no room to discuss all the subtleties of ontological engineering here.

Uniform Resource Identifiers

RDF uses Uniform Resource Identifiers (URIs) to unambiguously identify (reference) things. For instance, alex.karman@rev-mac.com unambiguously references me. It uniformly points to me. Notice it does not uniquely point to me -- I have other email addresses, too. But it uniformly points to me. Going back to our ambiguous "George Washington" example, for instance, the following Wikipedia articles are easily disambiguated:

In fact, there is an entire semantic database that mirrors wikipedia. For every human-readable HTML page, there is a computer-readable RDF page. For instance, our Courthouse Square facility is in Leesburg. But which Leesburg? There is one in Virginia, one in Florida, and one in Ohio. Each Leesburg has its own Wikipedia article. Well, our Leesburg is the one in Virginia. That can be disambiguated with two URLs, one for humans and one for machines:

The second URL returns RDF triples. The URI for Leesburg, VA is the string "http://dbpedia.org/resource/Leesburg,_Virginia" . But for geographic locations, there is an even better URI, called Geonames, which has over 5 million disambiguated and georeferenced place names. Visit http://sws.geonames.org/4769125 (human readable) or http://sws.geonames.org/4769125/about.rdf (machine readable).

Linked Open Data

Because RDF and SPARQL are hypermedia (they are RESTful and operate over ubiquitous, robust HTTP), a movement, called Linked Open Data (LOD), has arisen to make most resources on the World Wide Web, and especially Government data, available as RDF rather than as traditional unstructured text. Linked Open Data has four principles:

  • Use URIs for the names of things.
  • Use HTTP URIs, so people or machines can look up those names.
  • Provide useful information there, using RDF.
  • Include links to other URIs, so people or machines can visit more things.

LOD really started taking off after W3C formed the Semantic Web Education and Outreach Interest Group (SWEO) in 2007. Today, there are more than 300 LOD data sets, containing over 31 B concepts (nodes) and over 500 M relationships between them (edges). There is a visual representation of the "LOD Cloud" here.

Other important LOD sets include the following:

RDF databases with "SPARQL Endpoints" include Parliament, Virtuoso, Oracle, MarkLogic and Allegrograph. Google has adopted RDF in a big way, with the Google Knowledge Graph API, the adoption of JASON-LD for Google Mail, and with Rich Snippets.

Google Rich Snippets Example

Many websites include embedded RDF in their source code. Called RDFa (for "RDF attributes"), these references are defined as attributes on "span" and "div" tags. They are gracefully ignored by browsers, but Google Snippets picks them up. For example, open Google and search on the following:


movies at tysons corner

Do you see the well-formatted table in the search results with the movie names and showtimes? That is a "Google Snippet." How did Google do that? It read the RDFa (and other micro data) in the AMC Theatres website, that's how! Go to http://www.w3.org/2012/pyRdfa/Overview.html and put in the following URL : https://www.amctheatres.com/movie-theatres/amc-tysons-corner-16
. What do you see? The website found the following RDFa embedded in the AMC Theatres source code (angle brackets changed to parentheses):

@prefix fb: (http://ogp.me/ns/fb#) .
@prefix ns1: (business:contact_data:) .
@prefix ns2: (place:location:) .
@prefix og: (http://ogp.me/ns#) .

(https://www.amctheatres.com/movie-theatres/amc-tysons-corner-16) ns1:country_name "USA"@en;
ns1:locality "Mclean"@en;
ns1:phone_number "703-734-6212"@en;
ns1:postal_code "22102"@en;
ns1:region "VA"@en;
ns1:street_address "7850e Tysons Corner Ctr"@en;
og:description "Movie times, online tickets and directions to AMC Tysons Corner 16 in Mclean, VA. Find everything you need for your local movie theater."@en;
og:image "https://cdn.amctheatres.com/theatres/images/Primary/Large/366.jpg"@en;
og:title "AMC Tysons Corner 16"@en;
og:type "amctheatreswebsite:theatre"@en;
og:url "http://www.amctheatres.com/movie-theatres/amc-tysons-corner-16"@en;
fb:app_id "229843697087461"@en;
ns2:latitude "38.9173280000"@en;
ns2:longitude "-77.2200090000"@en .

You can also view the source code in the browser and search for another, non-RDF-compliant micro data format, the Movie Theater format from schema.org .

Back to Our CityGML+ARML Workshop Demonstration

Procedure

We want to demonstrate the use of CityGML as an index. To do that, we will perform the following tasks:

  1. Embed a link in the CityGML document to an external RDF resource, using XLINK.
  2. Search for and retrieve that link in the CityGML document, using XQuery.
  3. Query the external RDF resource, using SPARQL.
  4. Parse the JSON result set, using Javascript.
  5. Render the result, using OpenLayers.
  6. Get more dynamic, using the Parliament semantic database and Python.

To prevent us from having to scroll so much, I will continue this in Part 2 .