Some you may know that I’m the lead (read: only) developer for a semantic-web knowledge management platform called JengaNote. The central purpose of which is to enable students to manage all the information they consume day-to-day so that they can learn faster, better and with a deeper understanding.
One of the central problems of the system is presenting models and interfaces for note organization. There’s essentially competing priorities of information normalization and flexibility. We’ve selected RDF as the method of information exchange because we believe it to be a compromise between these two.
There are several people who make a case against the very notion of an ontology within collaborative systems - David Weinberger perhaps most articulate amongst them. They state that it’s simply impossible to construct an ontology that will satisfy all conditions under which data would be used - moreover the work-effort to construct, maintain and enforce this ontology would be a function of the total content but isolated to a small group of maintainers rather than being part and parcel with content commitment.
They suggest that the (now typical) method of folksonomy is likely the only sustainable method of sustainable information meta-data, and that any other scheme (unless intuitively obvious to all users - as in FOAF or vCard) for ontological construction would simply fall apart or be so under enforced as to be irrelevant.
Folksonomies, for the uninitiated, are the collection of keywords associated with content. Most people view it, however, inversely - the keywords associated with a given piece of content aka “tags”. As far as I know, Del.icio.us was the first to use a folksonomy as the backbone of information organization - Flickr followed suit soon after. Now most information systems offer some sort of tag-cloud like interface - including Wordpress (the software powering this blog).
The idea being that any determination of a content item categorical membership is a probabilistic inference of the aggregate keywords for that piece of content. Usually, any user can add a tag that doesn’t exist. In some systems users can “bump” the relevance rating for a keyword - allowing for variance of applicability amongst the keyword associations.
This system is particularly useful if keywords are the method of information retrieval - as in search systems, and the content itself isn’t easily parsed for keywords (as in photos or video) or in domains where the vocabulary is naturally constrained (blogs).
Ontologies, however, aren’t merely for single-item content retrieval (like search), ontologies are useful for content association - items that share the same classification are similar by definition - this isn’t necessarily true of a folksonomy. Usually this is handled by using multiple tags to distinguish items like “steve jobs” and “drawing” to get drawings of Steve Jobs rather than photographs.
With tagging there is no implied semantic relationship between the tag and the object being tag. Moreover, tags are constrained (usually) to simple literal text phrases. These two things severely retard a vast array of information retrieval tasks that extend beyond search - which is a crutch the internet needs to cast aside as soon as possible.
Firstly, it’s language specific - English tags only apply to English speakers. One could translate the individual keywords but often the lack of context and other tasks render this impossible to automate - requiring humans to translate the text or simply to recreate the tag cloud in each language; neither is a desirable outcome. Ontologies have no such limitation since they sort things semantically - that is, based on meaning rather than expression.
Secondly, tags have an unary relationship with the content. Either a piece of content has a tag or it does not - that is all that can be said confidently about the relationship between a tag. Relevance is determined by the improbability of that tag for any random article of content and, for systems that implement, the quantity of people who “voted” that tag as relevant. Thus the meaning of the tag word is meaningless, as is the nature of its association with the content. There’s no difference between, say a photo OF Steve Jobs and a photo BY Steve Jobs if they’re tagged with “Steve Jobs”. Or, for example, to know that a picture marked “JPEG” is the file-format of the photo or a photo taken at some sort of conference for Joint Picture Experts Group.
This second point is perhaps the most salient in light of the potential of Web 3.0. How can a machine meaningfully interpret these kinds of information?
This is not to say that the problems folksonomies seek to remedy aren’t very real - given the complexity of ontological construction, elaboration and enforcement for even a highly constrained domain it seems an intractable problem to construct a generalized solution that applies to virtually any domain scope and, perhaps more importantly contributor scale.
The answer is to the problem is simple. You don’t try. Folksonomies have the singular advantage of divesting content organization away from an institution and handing it back to contributors and, just as importantly, users. Folksonomies also have the advantage of adaptive organization rather than plenary information architecture - they only organize the content they have rather than attempting to organize all content possible. Content is tagged at point of entry into the system and the vocabulary of tags is totally unconstrained to deal with new possible categories.
The obvious solution is to create a project similar to wikipedia for ontologies. Where domain experts/enthusiasts, can generate a singular definitive ontology open for public consumption via web-services for a variety of applications. It would use OWL for data-storage and provide basic inferrence services and other niceties.
The issues with this wiki-ontology would be the same as those on wikipedia, only amplified. Wikipedia suffers facet divergence issues - that is, it’s difficult or impossible to integrate different perspectives on the same issue, especially if there are more than two or three such perspectives and each is mutually exclusive. The reality here is, only one of them is right - so wikipedia appeals to democracy to define correctness. This is a servicable solution for factual information, not so useful for something entirely subjective such as ontologies (since it’s the formalization of notions of classification - a system constrained only by arbitrary self-definition - which (hopefully) regress to axioms)
Wikipedia, because it has the competing tasks of definitiveness and comprehensiveness, has established conventional “hacks” to solve the issue of disagreement - either by wedging controversy into appropriately titled article sections or simply using consensus as the version of reality they choose to represent.
A similar system could be implemented for this wiki-ontology - multiple versions of an ontology could exist in parallel and users could elect to use one ontology over another. Through time, outlying ontological constructs could be pruned and mutually compatible systems could be merged. Since we have a formalized semantic framework this process could be automated - since compatibility of ontological definitions is computable.
Like wikipedia the proportion of people who contribute to the ontologies would be very small relative to the consumers. They key here is that firms that make use of the ontologies have a very strong incentive to commit to the project because they would have influence over the ontological structures of other firms who make use of the system - competing firms would do the same and, essentially, this wiki-ontology would be a platform for establishing industry-wide consensus/standard ontology systems; which is a win for everyone especially the user who can make use of mutually intelligble services.
This collaborative ontological system is compatible with existing “data standards” groups - which are really ontology groups in different clothes. Data Portability, for example could extend its purview beyond the syntactical and tackle the semantic information tracked within various formats.