Books, Ontologies and Shared Dictionaries

Recently, a Slashdot poster suggested the need for a free publicly accessible book database. The idea was a great one; individuals or smaller libraries could create a library catalog with a simple swipe of a handheld scanner. As it stands now, this information is stored in proprietary databases like Books in Print.

Leave aside for the moment the matter of whether a book database is adequate for a world where a large fraction of the reading material resides on websites and ebooks. Leave aside the question of whether publicly available PDF?s deserve ISBN numbers or if the convergence of media suggests the need for a more comprehensive database.

The real need, it seems to me, is a public/nonprofit database which provides a way of linking cultural artifacts (books, CD?s, etc) with commentary by users, readers or listeners. At the moment, two entities exist to capture people?s commentaries on books: the commercial Amazon posting system and the anarchic newsgroups.

The problem with Amazon?s ownership of reader comments is obvious: it becomes the intellectual property of the company that produced the user commentary system. It?s only a matter of time before hackers figure out a way to stream these comments to their own site, and I suspect that Amazon (and other commercial websites) will license these database of user comments to other media companies. Perhaps (hopefully) Amazon will turn them over to a content aggregator/search engine service like google.

Last year, Bezos was reported to have suggested the idea of restricting access to reader comments. Only ?subscribers? or customers would be able to read them. Perhaps this idea was never seriously pursued, but it illustrates the pressures that corporations face to create new monetary streams. In this case, reader comments served as promotional material for the product, but in other sites, the commentaries in and of themselves possess value.

Contrast that with the distributed network of news servers that constitute usenet. After google took ownership of deja?s archived newsgroups, google essentially become ?usenet? for most people (except for certain areas like alt.binaries). And google?s ability to guess at a site?s popularity and importance means that it?s becoming easier to search for certain things. Proper nouns, for example, or phrases within quotation marks like “shall I compare thee to a summer day” are relatively easy to find. But try searching for Kafka?s “The Judgment“, and you are treated with thousands of random entries. Surely, you can find noteworthy resources (such as this or that), But it is easy not to find certain essays and commentaries, especially if the keywords are hard to search by or if the subject is extremely popular. Success really depends on finding a fan who has made a fan site with links to his or her favorite essays. But if you couldn?t find other commentaries so easily, why would the creator of the other site be able to?

There needs to be a way to structure searches into different kinds of texts, organized by length, date, popularity and by the content itself. Up to now, relational databases have been doing that, serving as the backend of websites and caching millions of unstructured pages.

A partial solution has emerged. A group called Dublin Core has devised a standard way of typing content in HTML or XML. That gives search engines some help in being able to do more sophisticated searches or to make more intelligent guesses about content.

That seems to work with static content. But what about comments on a bulletin board, where the document itself is merely a rendering of an SQL query? It is essentially impossible to do a search of all comments on every bulletin board about Kafka?s ?The Judgment.? For one thing. pages showing results of database queries are often not visible to search engines. (If you search ?Robert Nagle? Slashdot , you will in fact not find any of my 30 some odd posts). Second, a search engine has no way of knowing that a bulletin board comment is in fact a comment or that it was written by ?Robert Nagle.? It can only dump articles containing those keywords and use algorithms to determine which document would probably be the most interesting to human readers.

(Now it?s probably true that a good database could spit out xml tags with appropriate namespaces or metatags for the result of the SQL query. But if each query required separate tagging, it is easy to imagine how unwieldy documents could become and how confusing the job of coding would become ).

With the advent of more structured documents (i.e. xml), Tim-Berners Lee call for a ?semantic web? seems less a fantasy than before. If sites using different dialects of XML can declare shared meanings (or ontologies). , it doesn?t seem too far fetched to construct ?intelligent agents? to incorporate more sophisticated winnowing methods than current engines allow. Assuming that the data is structured to begin (a la xml), data properly tagged can be easily located. But that requires people (or applications) to start tagging their data according to established ontologies. Databases (presumably hierarchical ones) would need to store these tags and manipulate them easily.

ontologies width=
From Semantic Web Activity: Different Layers of Meaning.

But who would store the semantics of the content, and where would the actual ?translation? from one XML dialect to another occur? Perhaps an ?ontological dictionary? can allow other dictionaries to be understood as well. If tags in AutosXML share common meaning tags with CinemaXml, and if CinemaXml tags share common meanings with BreakfastcerealsXml, then perhaps AutosXML can use information from a BreakfastcerealsXML document. That?s the hope anyway.

If everything goes as planned, content providers will type their data according to a well-established XML dialect. This will not be done manually, but probably at the application layer (and probably invisible to HTML browsers).

But where would these ?ontological dictionaries? reside and who would be responsible for updating them? Which ?ontological dictionary? should the web surfer use, and who ultimately determines which dictionary is to be used? And if the maker of a website decides to modify or extend the ontological equivalent of two tags, how would the dictionary be updated? And realistically, won?t corporations and publishers favor XML dialects with fewer constraints and less granularity of meaning? That is the promise and the peril of XML.

Update: Here’s an article written after mine: If Ontology, Then Knowledge: Catching Up With WebOnt by Kendall Grant Clark







Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.