Classification schemes have a role in aiding information retrieval in a network environment, especially for providing browsing structures for subject-based information gateways on the Internet. Advantages of using classification schemes include improved subject browsing facilities, potential multi-lingual access and improved interoperability with other services. Classification schemes vary in scope and methodology, but can be divided into universal, national general, subject specific and home-grown schemes. What type of scheme is used, however, will depend upon the size and scope of the service being designed. A study is made of classification schemes currently used in Internet search and discovery services, particular reference being given to the following schemes: Dewey Decimal Classification (DDC); Universal Decimal Classification (UDC); Library of Congress Classification (LCC); Nederlandse Basisclassificatie (BC); Sveriges Allmäma Biblioteksförening (SAB); Iconclass; National Library of Medicine (NLM); Engineering Information (Ei); Mathematics Subject Classification (MSC) and the ACM Computing Classification System (CCS). Projects which attempt to apply classification in automated services are also described including the Nordic WAIS/WWW Project, Project GERHARD and Project Scorpion.Some Internet services concerned with giving access to other Internet sites use classification schemes for organising a browsing structure giving access to their selected resources. This is especially true of Internet subject services which often use a browsable classified structure in addition to a searchable index.
The use of a classification scheme gives some advantages to an Internet service to the extent that it helps with browsing, enables the broadening and narrowing of searches, gives context to search terms being used, allows (under certain conditions) multi-lingual access to collections of material and the partitioning and manipulation of a large database. If an exisiting classification scheme is chosen, it will have a good chance of not becoming obsolete and will possibly be well-known to users.
Classification schemes can be defined by several categories, but can be broadly divided into:
- Universal schemes – examples include the Dewey Decimal Classification (DDC), the Universal Decimal Classification (UDC) and the Library of Congress Classification (LCC);
- National general schemes – universal in subject coverage but usually designed for use in a single country. Examples include the Nederlandse Basisclassificatie (BC) and the Sveriges Allmäma Biblioteksförening (SAB);
- Subject specific schemes – designed for use by a particular subject community. Examples include Iconclass for art resources, the National Library of Medicine (NLM) scheme for medicine and Engineering Information (Ei) for engineering subjects;
- Home-grown schemes – schemes devised for use in a particular service. An example from the Internet is the ‘ontology’ developed for the Yahoo! search service.
All of these classification types are used to some extent on the Internet. Universal schemes like DDC and UDC are used by many Internet services and are readily available in machine-readable form. Subject services, however, appear more likely to use a subject specific scheme.
The type of classification scheme chosen for use in an Internet service should depend upon the scope of service which is planned. A subject service, where possible, could use a well-known, international, subject specific scheme. Another service, which either has a more general brief or is in a subject areas where there is no agreed ‘standard’ classification system in use, could use or adapt a unversal scheme.
For the widest interoperability, more than one classification scheme could be used or conversion programs designed. Alternatively, a universal scheme could be used to ‘glue’ different subject services together while the actual services themselves would be classified in a different, relevant, subject-specific scheme.
Classification is a time-consuming and expensive process, so research has been carried out into the automatic classification of Internet resources. Various projects have investigated how subject terms collected from a search of a database can be converted into classification notation. Two projects, the Nordic WAIS/WWW project (Lund) and Project GERHARD (Oldenburg) used UDC for the conversion, while OCLC’s Project Scorpion is looking at DDC. Other projects are looking at neural-networks and at automatic conversions between classification schemes.
Automatic classification processes are also important if large robot-generated services want to add a browsing structure for their documents.
This report investigates the use of classification schemes to aid retrieval in a network environment, specifically with regard to the Internet. The library community, over many years, had appeared to favour subject indexing systems (the use of a controlled vocabulary to assign indexing terms to documents) over the use of traditional classification schemes (grouping documents into a hierarchical structure of subject categories). During the first period of the development of networked information services, many specialists, especially those from the computing community, also questioned the value of library subject description systems in principle, pointing to the accomplishments of full-text indexing software.
The increasing use of the Internet and the World Wide Web (WWW) for the storage and retrieval of vast amounts of information has, however, changed this perception. Two distinct ways of finding resources on the Internet emerged (Dodd 1996, p. 276). One approach consisted of the development of robot based search engines which could be used for powerful keyword searches of the contents of the WWW. These are extremely useful tools, although they have a tendency to return large amounts of irrelevant information. The other approach started with producing ‘hotlists’ which would encourage users to browse the WWW. The production of hierarchical browsing tools sometimes led to the adoption of library classification schemes to provide the subject hierarchy. At least one general discovery service, Yahoo! <URL:http://www.yahoo.com/>, devised their own ‘home-grown’ classification scheme (or ontology) to give structured hierarchical access to the resources which they had indexed. Quality-controlled subject services, which gave access only to selected Internet resources, also understood that a browsing structure based on subject classification would be a desirable compliment to a search engine type service. Most subject services of this type, and almost all of the Electronic Libraries (eLib) Programme access to network resources services and the proposed DESIRE test-bed services currently use a classification scheme which can be browsed. A list of Internet sites that use library classification systems or subject headings can be found in Beyond bookmarks (McKiernan 1996) <URL:http://www.iastate.edu/~CYBERSTACKS/CTW.htm>.
This report will describe the advantages of resource classification for subject-based information gateways in the Internet and will analyse the advantages and disadvantages of different types of classification systems and will then review some important individual schemes.
The use of classification schemes offers one solution to providing improved access to WWW resources. Web sites have been created to act as a guide to other Web sites selected according to some pre-specified criteria, e.g. they are judged to be good quality resources or relevant to a particular subject-area. Some of these sites typically consist of an alphabetical list of subjects, and selected Web resources are listed below each one.
Examples include Argus Clearinghouse <URL:http://www.clearinghouse.net/> and the WWW Virtual Library <URL:http://www.w3.org/pub/DataSources/bySubject/Overview2.html>. In this context, it can be understood why classification schemes have begun to be used to give added-value subject access to Web sites. A site that organises knowledge with a classification scheme demonstrates several advantages over sites which do not (cf. Svenonius 1983):
- Browsing: classified subject lists are easily able to be browsed in an online environment. Browsing is particularly helpful for inexperienced users or for users not familiar with a subject and its structure and terminology. In addition, the structure of the classification scheme can be displayed in different ways as a navigation aid. The classification notation does not even need to be displayed on the screen so an inexperienced user can have the advantage of using a hierarchical scheme without the distraction of the notation itself.
- Broadening and narrowing searches: classification schemes are hierarchical and therefore can be used to broaden (i.e. for improved recall) or narrow a search when required. Questions can be limited to individual parts of a collection (filtering) and the number of false hits be reduced (i.e. for improved precision).
- Context: the use of a classification scheme gives context to the search terms used. For example, the problem of homonyms (words which have the same form and spelling but a different meaning) can be partly overcome.
- Potential to permit multilingual access to a collection: since classification systems often use notations independent from a specific language, indices in different languages can offer multilingual access to the same resources without any further changes to the collection. A searcher could enter search terms in a given language and those terms would then relate to the relevant parts of the classification system (as a switching language) and be used to retrieve resources in any given language on the subject.
- The partitioning and manipulation of a database: large classified lists can be divided logically into smaller parts if required.
- The use of an agreed classification scheme could enable improved browsing and subject searching across databases.
- An established classification system is not usually in danger of obsolescence. The larger schemes are now undergo continuous revision, although they are normally also formally published in numbered editions. Some classifications may have to be changed when a new edition of a scheme is published, but it is unlikely that every single resource will have to be re-classified.
- They have the potential to be well-known: regular users of libraries will be familiar with at least part of one or more of the traditional library schemes. Members of a subject community are likely to be familiar with their (subject-specific) schemes as well. Use of an Internet service which uses them will therefore have an advantage over one that uses its own classification or none.
- Many classification schemes are available in machine-readable form.
Classification schemes, however, can be sometimes subject to criticism:
- The division of logical collections of material: classification schemes often split up collections of related material. This can be partly overcome with good cross-references.
- The illogical subdivision of classes: some popular schemes do not always subdivide classes in a logical manner (Buchanan 1979, pp. 32-34; Rowley 1987, pp. 188-189). This can make them difficult to use for browsing purposes.
- Assimilating new areas of interest: classification schemes, since they are usually updated through formal processes by organised bodies, often reveal difficulty in reacting to new areas of study.
There are several different types of classification systems around, varying in scope, methodology and other characteristics. Detailed descriptions cannot be given here, but it might be useful to know these different types, when trying to understand the terminology of this report and when decisions about which scheme to use is required.
Classification systems – by facet:
- by subject coverage: general or subject specific
- by language: multilingual or individual language
- by geography: global or national
- by creating/supporting body: representative of a long-term committed body or an home-grown system developed by a couple of individuals
- by user environment: libraries with container publications or documentation services carrying small focused documents (e.g. abstract and index databases)
- by structure: enumerative or faceted
- by methodology: a priori construction according to a general structure of knowledge and scientific disciplines or using existing classified documents
(The categories are not dichotomic, a classification can fit into more than one category).
The facet structure above shows what types of classification scheme are theoretically possible. In reality, the most frequently used types of classification schemes are: a) universal; b) national general; c) subject specific schemes, most often international; d) home-grown systems; d) local adaptations of all types.
The term ‘universal’ schemes is used for schemes which aim to include all subjects, are global geographically and multilingual in scope. Part 2 of the report deals with some of the most well-known individual schemes as examples.
The first practical universal classification schemes were developed in the late-nineteenth-century as a response to the problem of organising libraries in the context of rapidly growing knowledge and an increase in the numbers of printed books. Universal schemes aim to be both comprehensive and also to expand and contract to fit the state of knowledge at any time.
The most widely-used universal classification schemes are those which were developed for the use of libraries since the late-nineteenth-century, notably the Dewey Decimal Classification (DDC), the Universal Decimal Classification (UDC) and the classification scheme devised by the Library of Congress (LCC).
Use of a universal, multidisciplinary classification scheme in an Internet context results in the following advantages (in addition to the general advantages of using a classification scheme, see 1.2 above):
- They can cover all subject areas: The use of an agreed universal classification scheme as a global top-level structure could enable improved browsing and subject searching across services and collections from different subject areas. In theory, the use of an agreed universal scheme at many sites would allow for the widest interoperability. But it should be remembered that this is normally not the most important criteria when choosing a scheme for a certain service (cf. 4. Conclusions).
- They are widely supported: For the universal schemes, there is a global interest in support, development and survival of the scheme. DDC, UDC and LCC have been repeatedly revised since their first publication and are updated by responsible international bodies.
- They might be known to more users than other types of classifications: regular users of libraries will be familiar with at least part of one or more of these schemes. Use of an Internet service which uses them will therefore have an advantage over one that uses its own classification or none.
- They have an especially good potential to permit multilingual access to a collection: DDC was first published in English and UDC in French, but have both been widely translated. Full editions of UDC have been made available in English, German, Russian and Spanish, and abridged versions are available in other languages (Langridge 1973, p. 89; McIlwaine and Buxton 1995, pp. 7-8). DDC has been translated into 30 languages and is currently used in 135 countries (Thompson, Shafer & Vizine-Goetz 1997). This means that the tools already exist for multilingual access to Internet sites organised with these schemes.
- The major universal classification schemes are now all available in machine-readable form (see parts 2.1 – 2.3)
Universal classification schemes, however, are subject to several criticisms:
- False ontology: there is a general concern that universal schemes impose a false order upon knowledge. For example it was believed in the early 1970s that DDC still reflected its origins in a small North American university library (Foskett 1973, p. 39). The structure of enumerative schemes (most universal schemes are basically enumerative) is often perceived as subjective, and critics find many examples of inconsistency and illogicality. For this reason, library classification theory had begun to move away from enumerative schemes in the mid-twentieth-century. Examples of the alternative ‘faceted’ or ‘analytico-synthetic’ classification schemes are Ranganathan’s Colon classification (Ranganathan 1965) and Bliss’s Bibliographic classification (although both are hardly ever used), although later editions of DDC and UDC are faceted to a limited extent.
- Bad at assimilating new areas of interest: universal classification schemes often have a special difficulty in reacting quickly to new areas of study because they are updated with the time consuming participation of broad international multidisciplinary bodies. Researchers on the University of Illinois Digital Library Initiative project comment that most digital repositories contain “concepts and vocabularies too new or dynamic for controlled-vocabulary-based human indexing” (Schatz, B. et al. 1996, p. 33). Similarly, all classification systems are poor at handling new concepts and vocabularies, but universal classification schemes tend to have more disadvantages in this area when compared with subject-specific schemes.
Most of the advantages and disadvantages of universal classification schemes apply also to national general schemes (cf. 2.4. National general schemes), but they have additional characteristics that make them perhaps not the best choice for an Internet service that claims to be relevant for a wider user group than one limited to certain national boundaries.
Some of those characteristics are discussed here, relating to use of the scheme in the Internet environment:
- Although national general schemes offer coverage of all subject areas, they are in general not well known outside of their place of origin. For an international audience, a universal scheme would probably serve better.
- Support for a national scheme will be broad in the country itself, and a national institution has the responsibility for development. Support for the scheme outside of this national user group is limited. (e.g. use of the Nederlandse Basisclassificatie by German libraries which use the Pica system).
- Within the country the national scheme may be better known than universal schemes, e.g. the BC is used by Pica libraries in the Netherlands (mostly academic libraries), and SAB is used by almost all the public libraries in Sweden.
When the choice was made in the Koninklijke Bibliotheek to use the Nederlandse Basisclassifatie for an Internet subject service (the Nederlandse Basisclassificatie Web), this was done mainly because the subject specialists already used the scheme for classification of printed works. If NBW outgrows its national boundaries, for instance in the DESIRE context, or by the participation of non-Dutch institutions, the conversion to another scheme will deserve serious consideration, to make wider interoperability possible.
- Multilingual capability is not a primary concern for national schemes, apart from countries with multiple languages.
- National schemes are likely to have a geographic bias, e.g. the classification of languages in the BC is not only Eurocentric, but biased towards the Dutch context: Frisian – as a language spoken by a minority in Holland – has a separate class, while Asiatic languages have only three: Japanese, Chinese and ‘other’ Asiatic languages. This bias could be a serious drawback in an international context.
Most special subject specific schemes have been devised with a particular user-group in mind. Typically they have been developed for use with indexing and abstracting services, special collections or important journals and bibliographies in a scientific discipline. They do have the potential to provide a structure and terminology much closer to the discipline and can be more up-to-date, compared to universal schemes.
Examples of specific schemes are Engineering Information (Ei) for engineering, the National Library of Medicine (NLM) Classification for medicine and the British Catalogue of Music Classification. In subject areas like medicine, agricultural science and engineering, where there are international and widely recognised schemes available, subject services normally will prefer these or use them in combination with an universal scheme.
Subject specific schemes do have some drawbacks:
- It makes co-operation between subject services from different subject areas more difficult. Elaborate conversion programs will be needed in order to exchange resources or to point to them in another service.
- If they have a very small user-base it can be very difficult for the numerous users from other subject areas to learn the structure of the scheme.
- Collections of subject-specific resources are likely to include some fringe topics which will not be adequately covered within the specialist scheme itself (Langridge 1991, p. 16)..
It is therefore advisable that only well-established subject specific classification schemes should be used to describe Internet resources.
Some Web sites have tried to organise knowledge on the Internet by devising their own classification scheme. Yahoo!, created in 1994, lists Web sites using their own universal classification scheme or ‘ontology’, which contains 14 main categories. Each Web site collected for Yahoo! is listed under one of 20,000 categories or sub-categories (Steinberg 1996), the scheme being developed over time by the 20 people doing the classification work.
A study by Vizine-Goetz (1996a) showed that out of Yahoo!’s 50 most popular categories, all but four mapped perfectly to explicit DDC or LCC numbers or ranges. The results “… indicate that DDC and LCC have sufficiently wide topic coverage for classifying Internet resources”. The structure of Yahoo! would require encoding to take advantage of the relationships between classes which is handled by notations in traditional library schemes, an important prerequisite for automatic routines and improved navigation.
Home-grown schemes do have some theoretical advantages over library universal classification schemes:
- Home-grown schemes are relatively flexible and easy to change. For example, in 1995 Yahoo! was adding categories and making other changes to the ontology every day (Steinberg 1996).
- Home-grown schemes can very quickly absorb new areas of interest. Universal and enumerative schemes cannot just add new classification numbers when they are required, attention has to be given to keeping the numeric arrangement logical and easy to understand. This process can be very drawn-out.
On the other hand, home-grown schemes have a number of disadvantages:
- They amplify the problems of classification subjectivity and can lead to a lack of consistency. Steinberg (1996) notes that Yahoo’s more or less consistent point of view “comes from having the same 20 people classifying every site, and by having those people crammed together in the same building where they are constantly engaged in a discussion of what belongs where”. Other people using the same scheme or ontology might come to very different solutions.
- They are unlikely to be as well-known to users as universal classification schemes.
- If the scheme is self devised, it might need frequent revision with little chance of co-operation. The economic cost of this will fall entirely on the originator of the scheme.