|Semantic Web and Libraries
26 Library Systems Seminar
Rome, 17-19 April 2002
Ontologies have emerged in recent years as a new tool in the world of information science. This short paper is intended as an overview of the subject and aims to explain what an ontology is, how it can be used, and how it can be constructed. Our aim is to show the value of ontologies and to highlight the important role they can play in the field of information management, particularly with respect to the understanding and consolidation of semantically complex resources found in libraries, museums and archives.
What is an ontology ?
The term ontology is derived from philosophy, where it has a variety of meanings. In the context of information science it has taken on a specific technical sense. Although there is a general consensus about what an ontology consists of and what it should do, some of the definitions provided by experts in the field are not, at first glance, easy to grasp:
"An ontology is an explicit specification of a conceptualization"
Thomas R. Gruber
"An ontology is a logical theory accounting for the intended meaning of a formal vocabulary"
"An ontology is a formal definition of a body of knowledge"
"An ontology is a catalog of the types of things that are assumed to exist in a domain of interest D from the perspective of a person who uses a language L for the purpose of talking about D" John F. Sowa
Despite this, several common themes do emerge :
1 In ontology is a formal representation which aims to be precise, explicit and unambiguous
2. The ontology describes the entities and concepts relevant to a particular domain
3. An ontology embodies a particular view of a domain - the same domain may be described in different ways by different ontologies.
These three points are worth commenting on in more detail:
1. While most writers describe ontologies in general terms as a formal description or specification of a domain, there is as yet no single universally accepted definition of what an ontology should consist of. Some writers describe ontologies as consisting of entities, properties and the relationships between them, but this is inadequate for practical development of ontologies which requires the use of a clearly defined model. The CIDOC CRM, for example, is derived from the oo data model and consists of a hierarchy of entities and properties. Other ontologies are based on models derived from knowledge representation. These "ontological models" (in effect, meta-ontologies for ontologies) are all based on the assumption that the relevant aspects of a domain conceptualisation can be captured using just a few simple constructs.
2. Ontologies are domain specific for a very practical reason: describing the entire universe involves too much work. If the development on an ontology is to finish, the scope and aims have to be clearly defined at the outset. An unrestrained, universal ontology would tend to extend indefinitely and would require unlimited resources - few organisations would be willing to foot the bill (though encouraging one's competitors to do so might be an interesting ploy).
3. Different writers attach varying degrees of importance to the idea that divergent, and potentially incompatible ontologies may exist for the same domain. However, this point throws back to an important philosophical debate concerning the possibility of multiple conceptualisations of reality. Divergences may emerge merely as the result of describing different domains or different aspects of the same domain - such as population statistics viewed from the perspective of health care rather than life insurance - but these differences are in a sense, trivial, since they do not reveal any fundamental disagreement. However, it can be argued that incompatible yet equally plausible ontologies may emerge in cases when the domain is rigorously identical - suggesting that the underlying conceptualisations are not the same. One would expect conceptual divergences of this sort to emerge, for example, when rival theoretical hypotheses or religious convictions are at stake. This has the important consequence that uniting two or more divergent ontologies may in some instances turn out to be conceptually impossible. An ontology represents a series of propositions about entities which exist or do not exist within a given domain. Authors of incompatible ontologies would, presumably, attach different truth values to these propositions and might well be unwilling to compromise on their beliefs. Ontologies, as explicit formal representations, may serve a useful role in such cases by helping to reveal and clarify differences in conceptualisations which might otherwise remain concealed by common terminology - similar words may sometimes be employed even though very different ideas and concepts are being expressed.
To summarise, we can say that a particular domain may be understood - conceptualised - in different ways and that an ontology is an explicit, formal representation of a particular conceptualisation.
Using the CIDOC CRM ontology
So what can I do with an ontology? As an example, let us look at the ICOM / CIDOC Conceptual Reference Model (CRM) - an ontology developed for cultural heritage sector. CIDOC has been working on the CRM since 1996. Building on previous work, and drawing on a wide range of experts from museums, archives, libraries and computer science, the CRM represents the crystallisation of years of accumulated experience and provides as valuable guide to best practice in the domain. It is currently under consideration as a new ISO standard.
The CIDOC Conceptual Reference model can be defined as a "domain ontology" - in the sense described above - a formal analysis of the things, ideas and relationships which are fundamental to a particular field of activity. The CIDOC ontology is based on an object oriented model and is composed of entities, organised into a hierarchy and related to each other through property links. This structure of entities and properties provides a framework for describing the complex interrelations that exist between objects, actors, events, places and concepts in the field of cultural heritage. Version 3.2 of the CRM contains over seventy entities and nearly a hundred property links.
Very impressive, you might be thinking, but what can I actually do with an ontology? Does it have any practical applications? What benefits are there to using it?
Perhaps the most immediate role of the CRM is simply as an aid to comprehension and dialogue. As its name indicates, the CRM is a reference document which can help to establish the conceptual common ground between different disciplines and domains. The need for clear and unambiguous communication is critical to IT projects in the cultural heritage sector which bring together domain experts such as art historians, archaeologists, and biologists, with IT specialists and other technicians. In order to design and build satisfactory information systems, technical experts are faced with the difficult task of coming to terms with all the complexities and subtleties of cultural heritage information, while domain experts need to explain their requirements in terms which IT specialists can understand, and evaluate the solutions which they propose. Misunderstandings in the design of information systems - as everyone knows - can turn out to be extremely costly.
By providing a rich and detailed analysis of the cultural heritage domain, the CRM can facilitate dialogue between cultural heritage experts and technical specialists. The entities and property relations of which it is composed are all clearly defined - through textual description, scope notes, examples and cross references, and thanks to their place within the formal structure. This multiple and "redundant" presentation is intended to be accessible to technicians and domain experts alike - cultural heritage professionals may see it as a formal representation of familiar concepts while IT specialists can view it as a high-level blueprint for an information system. The CRM provides, in effect, a basis for mutual comprehension.
Apart from its role as a purely conceptual reference, the CRM can also serve as a technical reference for use in comparing and evaluating information systems, data schema and the like. Comparing existing or projected information systems with the CRM helps to highlight divergences - both in scope and in structure - which can then be examined in more detail to see if they are justified or not.
The value of the CRM as a technical reference becomes particularly clear when it is used as a basis for data transfer between incompatible systems. The CRM schema can provide the semantic backbone needed to design a common data format which can be shared by a number of different systems - a technical lingua franca which allows data to be transferred from one system to another. If data need to be shared between a number of different systems, the use of a single intermediate reference format is a simple and efficient way to proceed; otherwise, the number of transfer protocols increases exponentially as more systems are included.
You may be wondering at this point why something like the CRM is necessary. Doesn't XML do this already? Unfortunately, no. While XML does indeed provide a useful, system-independent means for formatting data, it cannot by itself define the semantic content. XML DTDs and schemas are intended to do this - but then they have to be defined. The arduous task of mapping one data element to another still has to be performed, which is where the CRM comes in. The CRM can be used for designing a common XML schema, avoiding the need for multiple, ad-hoc solutions. Once this reference XML schema is in place, it provides a target format allowing the integration of further data sources.
Providing an extensible basis for data transfer between heterogeneous systems is of enormous value since it not only facilitates data transfer between institutions, but also makes migration between systems far simpler and, possibly most important of all, provides a basis for stable, long-term archiving of data. The embarrassing fate of the 1986 Domesday project in the UK - an impressive multimedia database compiled by schools across the country, stored on 12 inch laser disks designed for the BBC micro-computer and now unreadable after just fifteen years - is a reminder of the importance of recording important information in a system-independent format.
Finally, the CRM can be used as technical specification for the design of new information systems. It is important to underline that, although it is possible to do so, the CRM is not intended to be implemented as is. Some adaptation is normally required - both pruning and extension. The CRM, it must be remembered, is intended to cover the entire field of cultural heritage information, at a level of detail acceptable for scientific research. This means that much of the model would be superfluous to any practical, discipline-specific implementation. Similarly, the degree of detail contained in the CRM would need to be enhanced in some areas to encompass institution specific requirements. The CRM has been designed to make this process of adaptation as simple as possible by providing 'plug-in' points and guidelines for extensions which remain compatible with the overall structure. The CRM has been used successfully as the basis for the design and implementation of a number of database applications - such as Geneva City's Musinfo project. The advantages of using the CRM as a starting point for a technical specification are that it avoids much of the trail and error involved in modelling an information system, and allows for a more flexible design which can be more readily adapted to future needs.
Possibly the most ambitious application of the CRM is in the development of integrated query tools and mediation systems. At present, information stored in libraries catalogues, archives and museum collections remains pretty much isolated. Different resources need to be queried individually and cross system links are rare. Combining and integrating information from multiple sources has the potential to add value to existing data - facilitating research and enhancing the quality of users' experience.
Physically combining data into a single system would be impossible, both technically and for organisational reasons, so mediation systems aim instead to federate information sources, making distributed queries possible without the need to for a megalithic database. A typical mediation system acts as a single interface for users. It accepts and interprets queries and distributes them to participating systems. These systems reply to the mediator, which consolidates the results for the end user. In order for a query mediation system to function correctly, it has to be able to communicate with each participating system in a way it can understand, and interpret the results. Participating systems are unlikely to have identical data schema and may well store different levels of detail about similar objects, so the mediation system needs to be a semantic polyglot.
Using the CRM as the basis for the mediation system's data schema makes distributed query systems much easier to design. By mapping each participating system's internal data representation to the canonical form provided by the CRM, it becomes possible to integrate and interpret data stored in otherwise incompatible systems. Many features of the CRM have been specifically designed with mediation in mind, allowing data to be combined from relatively rich and less detailed sources in a meaningful way and without loss of detail.
In short an ontology such as the CRM can be used in the context of IT projects to improve communication and help to avoid misunderstandings. As a reference for good practice it can be used to compare and evaluate existing systems. In a technical context the CRM can be used as a basis for data archiving, exchange and integration - an important contribution to the creation of a global network for cultural heritage information.
Making an ontology
The method described here - though intentionally simplified - is based on the same logical steps involved in producing the CRM, described above. In practice, the process tends to be far more iterative and messy, involving multiple revisions, discussion, hesitations and the consumption of much coffee and innumerable chocolate biscuits.
We take as our example a deliberately familiar domain - for English people at least - making a cup of tea. The example - need we add? -is intended to be amusing and should not be taken too seriously.
1 Establish the scope and aim of our ontology
The first step is to define the scope of the ontology and its purpose. This may not be as obvious as it first appears. In our example, which of the following best reflects our aims and concerns ?
The reader will no doubt be able to imagine further additions to the list. Defining the scope is also important since it will have an impact on the participants who need to be involved in the process. Producing an ontology is a consensual process, so the relevant partners need to be involved from the outset.
2. Identify the entities that are specific to the domain
In the case of our tea making example, a number of ingredients are clearly involved - tea, milk, sugar, etc. -as well as tools, such as tea pots, kettles, cups and spoons. Less obvious are the processes, or methods, needed to transform the ingredients into something drinkable and the measurements and quantities which have to be respected. These are all entities in the sense of an ontology - entities may be concepts as well as physical things.
This first-pass identification of entities is likely to omit entities which later turn out to be of importance, and will not encompass all the fine-detail which may be necessary. For the sake of discussion, we may settle on the following list.
3. Organise entities into a hierarchy
The notion of hierarchy used here is derived from object-oriented programming - an "Is A" hierarchy starts from very general classes of objects which are progressively specialised into sub categories. Our list of entities can conveniently be grouped into four main branches: ingredients, utensils, actions and measurements.
It is common at this stage for intermediate entities to be introduced as a means of clarifying distinctions which were not initially apparent. The distinction between required and optional ingredients is a good example.
4. Define the entities
The hierarchy already gives us some understanding of the different entities in our ontology. However, it is also important to provide precise definitions and examples wherever possible. This may seem fastidious but, by themselves, the names for entities are often insufficient to avoid misunderstandings - the task of formulating definitions helps to ensure that all participants share the same understanding of what each entity represents. Another important reason for providing definitions is that no obvious name may be found for some entities - this is particularly the case with some abstract concepts. Attempts to arrive at a satisfactory appellation can be both time-consuming and frustrating. Providing definitions helps to diminish the importance that is attached to names.
In the case of our example , our definitions would have to provide answers to questions like the following
5. Properties of entities
Further clarification of the meaning of entities is provided by their properties and attributes. Entities are both described and to some extent defined by their properties. It is a defining characteristic of quadrupeds, for example that they normally have four legs. As a general rule two entities should not have exactly the same attributes - if they do they may be better represented as two aspects of one class of objects.
In order to characterise sugar, for example, we might require the following properties, each of which can take one or several values.
cane, beet, maple
granulated, castor, lumps, cubes ...
6. Identify the relationships
Having identified and defined the entities in our ontology, we can proceed to define the relationships between them. The water, the act of boiling water and the kettle all obviously have something to do with each other. Water is a passive participant in the boiling operation and is transformed, the kettle is used but not significantly altered.
Relationships are generally named using verbal forms. They work two ways and can be defined from two different points of view. For convenience in 'reading' relationships, two verbal forms may be used, one for each direction.
7. Describe and define relationships
Naturally, relationships need to be described and defined in the same way as entities, and for much the same reasons.
8. Refinements and extensions
At this point it may seem as though the work is finished - however, this is really only the beginning for now the task of refining, extending and improving the ontology begins - comparing it with other sources of information and testing it to ensure that it effectively covers the intended domain. Specific applications may require additional levels of detail, which the ontology will need handle. In our example, sugar is the only form of sweetener that has so far been recognised, though others exist. Extending the ontology to include other products requires the creation of a new 'sweetener' entity, which can encompass sugar, saccharin, honey, etc.
The fragment below also includes two examples of "multiple inheritance" - a useful 'wrinkle' derived from oo modelling. Despite the off-putting name, this simply means that Fructose, for example, can be regarded as both an example of a natural sweetener and a diet sweetener.
The aim of this short article was to provide a rapid introduction to working with ontologies - hopefully, you should now have some idea about what they are, what they can do and how to make them. At the very least, you may have been surprised to realise how much you knew and how much there is to say about making a cup of tea. That sense of discovery is, after all, one of the major reasons for making an ontology.
Conseiller en Systèmes d'Information
Patrick Le Bœuf
Bibliothèque nationale de France
Université de Marne-la-Vallée