The big picture
This chapter provides an overview of how metadata produced at a local institution can be shared with metadata created at another institution. Creating shareable metadata and opening your metadata to other services will bring new visitors to your site and increase awareness of your collections. All metadata should be created with shareability in mind, and this chapter will provide ideas for how to make metadata more interoperable. Additionally, this chapter will cover the basics of the Semantic Web and will explain how cultural heritage resources can be described in RDF. Topics include sharing metadata through metasearch, the Semantic Web and RDF, crosswalking and mapping, and OAI-PMH.
The previous chapters explained how to make metadata at your own institution, and how to fill the needs of your own patrons. However, we haven’t yet discussed how all of the metadata created at your institution can fit with metadata created at a different institution. From a user’s point of view, each metadata record is like a piece of a jigsaw puzzle in their search for resources relevant to their work. How all these records fit together will determine the big picture of resources available for their research topic.
If a researcher is studying a specific topic, he or she needs to know that your institution may own a collection that is relevant to their topic. Previously, unless this collection was well advertised, the researcher may not know that it existed. However, now that you’ve invested resources into describing your collection, it’s equally important to make sure that access to the collection is available through subject portals and aggregations.
Digital collections exist at many institutions and users are no longer geographically limited to collections in their surrounding areas. Users can explore collections held far away as long as the resource has been digitized, and the user is able to find the resource. But how can the user find a resource in your collection if they don’t know that the collection exists? The solution is to share metadata about your resources in as many places as possible.
Why should we share our metadata? Increased exposure of our resources’ metadata increases exposure of our collections. Providing access to our resources in a variety of portals helps to ensure that researchers looking for resources are able to find them. If a researcher finds an item in our collection, and had not realized that our collection was relevant to their research, they are more likely to think of us in the future. Use of our digital resources broadens our user base and helps our institution remain relevant in the future.
Users familiar with Google and other portals many times completely bypass the “front” webpage to our collection. If a resource in our collection is relevant to their search, they’re likely to jump into the collection at that page, and start exploring our resources. But, in order for the first page they see in the collection to make sense, the item needs good metadata to help contextualize the resource and direct the user to other related items.
Libraries have developed methods to share metadata with subject portals and national and international portals. An example of this type of portal is OCLC’s OAIster (pronounced ‘oyster’) (http://oaister.worldcat.org/). OAIster collects millions of metadata records created by various cultural heritage institutions, allowing users to search across all resources simultaneously. These records are brought together through use of OAI-PMH into a single database.
OAI-PMH divides the world into data providers and service providers. Data providers create and expose their metadata. Service providers collect the metadata, normalize it so that it matches metadata from other harvested collections, and make it searchable through a portal. Although metadata records are only as current as the most recent harvest, OAI-PMH makes refreshing of harvested records fairly simple. Advantages are that the aggregator has control of the database and servers and of results displays, and records can be ranked according to relevance. More details on this protocol follow later in the chapter.
Another common method used to make metadata searchable is metasearch. A specific type of metasearch is NISO Z39.50. The metasearch method performs a live search across multiple databases simultaneously, and returns the results to the users. This is a common approach to simultaneous searching of many databases that libraries subscribe to without having direct access to the metadata. Drawbacks are that searches are processed at the hosts’ database, and the library doesn’t have control over how results are returned. Another drawback is that results aren’t integrated across databases, and there is no way to determine relevance between two different result sets. Advantages are that metadata is up-to-date, and the library does not need to maintain a large database of metadata records.
Another way that libraries can share data with the world is through the Semantic Web and linked data. Current descriptive metadata practices rely on computers transmitting the data and humans reading and understanding the data. However, the Semantic Web aims to make the data understandable by computers so that machines are able to interpret resources and find connections that humans might not be able to, due to the limitations of human understandable metadata. One of the main structures of the Semantic Web is RDF and its use of statements. In this framework, a resource is described using several statements instead of a single metadata record. An RDF statement has three parts, or a “triple.” These three parts are subject-predicate-object expressions. Related to the cultural heritage digital collection metadata we’ve been using in this workbook, the subject is the resource being described (e.g. this book), the predicate is the metadata property (e.g. “creator”), and the object is the metadata string (e.g. “Jackson”). All three parts of this statement can be represented by a uniform resource identifier (URI), and the predicate is drawn from a controlled vocabulary. Ideally, the object is also from a controlled vocabulary.
A visual representation of RDF looks like Figure 7.1:
For an example, let’s look at the following image (Figure 7.2🙂 and associated metadata:
In these examples, the subject is the image, defined by a URI (02231177), the predicate is the property name (title or type), and the object is the value of the property (“Wright’s Trading Post, Albuquerque, NM,” or “image”). See the following exercise (Figure 7.3:).
When RDF is stored in a database, queries can be performed using SPARQL (pronounced “sparkle”), a query language for retrieving and manipulating data stored in RDF. In the semantic web, RDF can be used with OWL (Web Ontology Language). An ontology is a representation of knowledge within a domain through a specific set of concepts and their relationship to each other. OWL is a specific computer language for making ontological statements that can be used with RDF.
RDF is a very powerful language for machines, and allows machines to “understand” the metadata used to describe resources. However, as you can see, it’s also very complex and time consuming to identify all resources at this level. The Semantic Web and linked data are still under development as a proof of concept, and may or may not become major future developments of the World Wide Web. Efforts such as those by the Dublin Core Metadata Initiative/Resource Description and Access Task Group may aid in this development by creating RDF vocabularies that will be adopted by the library and other metadata communities (Hillmann et al., 2010). Professionals working in small cultural heritage institutions do not currently need to be overly concerned with the Semantic Web, but should be aware of how metadata is used in this environment and watch its development.
All descriptive metadata will eventually be seen by humans interested in resources, so human understandability is important. A metadata record should clearly explain the resource being described. Additionally, although your targeted user group may be subject specialists, when your records are exported to an aggregator, the user of a service provider’s portal may not be a subject expert. You should ensure that non-subject specialists can also understand the record.
Shreeves et al. (2006) provide the following “six C’s and lots of S’s of sharable metadata.” The “C’s” are: context, consistency, coherence, content, communication, and conformance to standards. The number of standards involved are the “lots of S’s.”
In an aggregated environment, the context your individual records may have received from being part of a larger collection is lost to a user. In order to make metadata records shareable, you should provide some contextual information in the metadata. An example of this was articulated by Robin Wendler as the “on a horse” problem (Wendler, 2004). Wendler was working with metadata records from the Theodore Roosevelt collection, and noticed that some records describing images of Roosevelt did not contain the word “Roosevelt.” A particular image of Roosevelt sitting on a horse was titled “on a horse,” but the record did not contain his name. Since the original record was part of the Roosevelt collection, the metadata creator assumed that the users would know that the image was of Roosevelt. However, in the aggregated environment, there are no clues as to where the record came from, and the record needs to explicitly say that the image is of Theodore Roosevelt.
Metadata from individual repositories should be created with consistency in mind. If all records are created with the same practices, an aggregator can make uniform changes across all harvested records to match practices in their own environment. For example, if an aggregator prefers the first name first in the author field, but you’ve created all of your records with the last name first, the aggregator can flip the order of the name on all records. However, if your practices haven’t been consistent and some records have first name first while others have last name first, it will be much more difficult for the aggregator to make all of your records conform to their practices. Important things to remember when trying to make records consistent are to use all fields consistently, use controlled vocabulary consistently, and use consistent encoding schemas. For example, you should choose a controlled vocabulary like LCSH or AAT and use this vocabulary for all metadata records. A common understanding of all fields is important, so that all records are using a field in the same way. This is why standards, such as Dublin Core, VRA Core, EAD, or CDWA, are so important in this community. If one collection uses the “source” field to describe the original resource, and another collection incorrectly uses it to describe the collection, users will not be searching the same information when they limit their search to a specific field. It’s also important to consistently use all appropriate fields. If your collection doesn’t use a relevant field to describe resources, but the users limit a search to that field, not all resources in your collection may be included in the results.
Metadata records should be able to stand alone and be coherent to a non-specialist. As soon as a metadata record leaves your local environment, you cannot assume that users will have any specialized knowledge of the subject. For example, a local battle may be well known to an historic society’s community, but a person on the other side of the country may not know anything about it. When creating a metadata record that mentions the battle, be sure to give extra information for individuals who may not know anything about it. As the linked data environment becomes more mature through consistent use of URIs, the data behind the URI will provide this extra information.
The content of a metadata record needs to be optimized for sharing. Granularity of the description is an important part of the content, and should be considered before exporting records. For example, in an image collection, most institutions take the time to describe every image. However, individual books are a collection of images held together with structural metadata, and each page could have an associated metadata record. If every page’s metadata record was exported to an aggregator, there would be lots of extraneous metadata records. The aggregator will probably only be interested in metadata at the top level of the book, unless the individual pages hold significant information. Metadata creators should also ensure that metadata relevant at only a local level is not exported to a service provider. Many times, this type of metadata is called administrative metadata, and contains information about local digitization equipment and practices, when the resource was digitized, and other information. Although this metadata may be valuable at a local level, it also creates lots of noise in the aggregated environment. Another important aspect of content is awareness of the influence of controlled vocabulary in the aggregated environment. Even though your collection may be using a controlled vocabulary, as soon as metadata leaves your environment and co-mingles with other metadata records, there is no longer a controlled vocabulary for all records. For example, if collection X uses LCSH for subjects, but collection Y uses AAT for subjects, when the records are mixed in an aggregated portal, the benefits of a single controlled vocabulary will be lost. Subject terms from multiple thesauri will be searchable by users. As linked data becomes more common, URIs identifying the controlled vocabulary will travel with the record, and aggregators will be able to identify the multiple controlled vocabularies in their portals.
Finally, communicating with service providers is an essential step in the sharing process. This can be as simple as posting your metadata best practices, use of controlled vocabularies, and encoding standards on your website so that service providers can understand how your records are created. If this is documented and accessible, they can easily make changes to optimize your metadata for the highest level of interoperability within their portal.
Sharing records also requires that your institution follows and conforms to national standards so that aggregators have access to, and can understand, your records. Data structure, data content, controlled vocabularies, and technical standards for encoding and transmitting are essential parts of interoperable metadata.
Appropriate data structure ensures that all repositories are using commonly understood fields and field definitions. Examples of data structure include Dublin Core, MARC and MARCXML, EAD, VRA Core 4.0, and CDWA. Data content standards ensure that all text strings are created consistently across multiple repositories. Examples include AACR2, CCO, and DACS. Controlled vocabulary and encoding standards are helpful to ensure consistency within your own collection, but, as mentioned above, controlled vocabulary may lose significance in an aggregated environment. Examples of controlled vocabulary include LCSH, AAT, LCNAF, and Dublin Core type vocabulary. Encoding standards include ISO 8601 (YYYY-MM-DD) for encoding dates, and ISO 639–2 (three-letter codes) for encoding languages.
Technical standards ensure that metadata can be transmitted and that machines can understand certain fields for optimum indexing. Examples of technical standards include XML, OAI-PMH, and date encoding standards.
Steven Miller identifies five ways to improve metadata interoperability (Miller, 2011, p. 245):
In your local institution, you might create local metadata fields, or have legacy records created with metadata fields specific to your institution and/or project. While this is appropriate in a local context, you will need to turn these records into metadata records following a common standard, such as Dublin Core, VRA Core, or CDWA. You might also need to translate records created with one of these standards into another standard. The process of doing this is called mapping or crosswalking. Although these two terms are often used interchangeably, mapping is often associated with determining how fields in one metadata schema are related to fields in another metadata schema, while crosswalking is running individual records through stylesheets to transform them into another metadata standard. Also, official tables showing relationships between metadata schemas, often maintained by an agency such as the Library of Congress, are referred to as crosswalks. In the Semantic Web, different approaches are possible due to the uncoupling of individual statements from an entire record. Maps showing the relationships between properties may become more useful than the current method of the best match between element definitions in different schemas (Dunsire et al., 2011, p. 6).
When creating local metadata fields, be sure to consider how these fields will map to common standards (Han et al., 2009, p. 235). At the very least, your local field definitions should map to Dublin Core fields. Use local value strings rather than local field naming conventions, and consider granularity when creating field names. For example, you may be tempted to create individual metadata fields for a journal citation in an electronic repository.
Obviously, this is meaningless in an aggregated environment, and does not follow the best practices for sharing mentioned above. Instead of using local field naming conventions to identify the location, use local value strings, such as “Volume 23, no. 1, pages 14–20.” When mapping to the Dublin Core Source field, this will appear as:
If the local fields are needed for internal processing, be sure that when mapping to Dublin Core the entire citation information is mapped to the source field. This follows our best practices and is meaningful to a user who finds the record in an aggregated database.
Additionally, you should determine which fields are local and only helpful in the local environment. For example, many institutions include administrative information in their records, such as the equipment used for the digitization project. While this may be useful locally, in an aggregated environment it isn’t helpful, and creates additional noise in the results.
If you need to create more metadata fields outside the standard you’re using for your records (many times this is done to record local information), you should take field names from other definitions. This will ensure that you’re using a common element, and will make mapping to other schemas much easier. For example, if you want to record the nationality of an artist, but the Dublin Core standard you’re following doesn’t support this level of description, you can look instead to VRA Core 4.0 to see how they encode information about nationality.
If you’re interested in more information about crosswalking, the Getty Institute maintains a Metadata Standards Crosswalk showing how elements in each standard relate to each other at http://www.getty.edu/research/publications/electronic_publications/intrometadata/crosswalks.html.
You may have previously heard the phrase “massaging the metadata.” Metadata librarians often use this phrase when they transform metadata from one scheme to another scheme. Mapping is the intellectual activity of deciding how metadata elements in different schemas relate to each other, but the metadata needs to go through some transformations in order to make this happen. Most of the time, this is done by using XSLT. XSLT is a language for transforming XML documents into XHTML or other XML documents. This is useful for mapping between metadata schemas or displaying XML in a viewer friendly way. In addition to XSLT, a language called XPath is used to navigate through an XML document and point to a specific XML element or attribute.
XSLT is written in XML, and starts with <xsl:stylesheet> or <xsl:transform>. The XSLT namespace resides at http://www.w3.org/1999/XSL/Transform. There are lots of XSLT tutorials online, if you’re interested in gaining more hands-on experience with XSLT. A good place to start is the w3schools.com XSLT Tutorial.
OAI-PMH was developed in 1999 as a low-barrier technology intended to help users find e-prints located in various repositories. OAI-PMH uses a harvested approach rather than the federated approach of metasearch. In a harvested approach, metadata is collected from selected databases and combined into one place, allowing users to search across metadata produced at different institutions in a single database and interface. Users of OAI-PMH are either data providers or service providers. Data providers create metadata and expose it to harvesters. Service providers return metadata records to the users and include links back to the original repository, or data provider. In this approach, content still resides in the original repository and service providers point back to it.
OAI-PMH is based on the HyperText Transfer Protocol (HTTP) and XML. This means that XML metadata records are transferred across the internet we’re all familiar with, and the XML records can be viewed on a browser.
OAI-PMH requires that all metadata be exposed in Simple Dublin Core as a minimum, using the oai_dc XML schema. OAI-PMH also supports the exposure of other formats. Because of this, records for individual resources may be represented in several different metadata records and “languages,” such as Dublin Core and VRA Core, but the identifiers all point back to the same resource.
OAI-PMH starts with a base URL of the repository. In an OAI request, the base URL is followed by a question mark and one of six verbs. The six OAI verbs are: Identify, ListSets, ListMetadataFormats, GetRecord, ListRecords, and ListIdentifiers. For example, the New Mexico Digital Collections (http://econtent.unm.edu/) has an OAI-PMH base URL of http://econtent.unm.edu:81/cgi-bin/oai.exe. If you enter this URL into a web browser, an XML record will be returned that says you have an “Illegal OAI verb.” This is because you haven’t specified a verb yet. One of the easiest requests is for identification information about the repository. You can request this by adding “?verb=Identify” to the end of the URL (http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=Identify).
1. Identify. Provides a description of the archive. (Example: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=Identify)
2. ListMetadataFormats. Provides the types of metadata formats that records in the archive use. (Example:http://econtent.unm.edu-.81/cgi-bin/oai.exe?verb=ListMetadataFormats)
The next verb may include a ResumptionToken. A ResumptionToken is used when many XML records are returned in a list, and the data provider’s server may get overwhelmed by providing all of the records at the same time. Data providers can determine their own limit before providing a ResumptionToken. Generally 200–500 records are returned before a ResumptionToken is listed. ResumptionTokens are listed at the end of the results list and can be added to the original verb request. An example of a request with a resumption token is under #5 (ListRecords).
3. ListSets. Provides the structure of the archive/repository. In most repositories, like ContentDM and DSpace, sets correspond to collections, although this doesn’t have to be the case. (Example: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListSets. All sets are returned in this example, and no ResumptionToken is needed.)
The next verb requires use of a metadataPrefix. As mentioned previously, OAI-PMH requires use of Dublin Core, but supports metadata in other formats. As long as each resource is described using Dublin Core, a data provider can also provide additional metadata records for the same resource in a different metadata language. For example, a resource could be described using both Dublin Core and VRA Core records. Because of this, when you request a record or records from a repository, you must specify which metadata type you’d like the record to be returned in.
In addition to the metadataPrefix, the following verb also requires use of an OAI identifier. The OAI identifier is not the resource identifier (e.g. dc:identifier). Instead, the OAI identifier identifies the XML record. You can find it in the header of the XML record.
4. GetRecord. Provides individual records (based on OAI identifier). Requires MetadataPrefix and Identifier. (Example: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=GetRecord&identifier=oai:econtent.unm.edu:abqmuseum/0&metadataPrefix=oai_dc)
From and Until let you specify the date ranges of when the records were created. This is taken from the <datestamp> field in the XML header. Note that it is different from the dc:date field. The datestamp is recorded in YYYY-MM-DD format, and any level of granularity is accepted. If you would like all records created in 2009 you can use from=2009-01-01 and until=2009-12-31.
5. ListRecords. Returns complete records (number limited by data provider’s repository, additional records returned by ResumptionToken). Optional Set, From and Until. May include ResumptionToken. (Examples: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListRecords&metadataPrefix=oai_dc;http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListRecords&resumptionToken=:::oai_dc:1000;http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListRecords&metadataPrefix=oai_dc&from=2007-01-01&until=2009-12-31.)
6. ListIdentifiers. Provides a list of OAI identifiers. Requires MetadataPrefix. Optional Set, From and Until. May include Resumption Token. (Example: http://econtent.unm.edu-81/cgi-bin/oai.exe?verb=ListIdentifiers&metadataPrefix=oai_dc.)
You can experiment with any open data provider. Many institutions open their ContentDM servers to harvesters. If you know of a ContentDM digital collection, you can include cgi-bin/oai.exe after the home URL to find the OAI-PMH base URL, in most cases. For example, New Mexico’s Digital Collections (http://econtent.unm.edu) uses ContentDM, and the OAI-PMH base URL is http://econtent.unm.edu-81/cgi-bin/oai.exe (notice the addition of “:81” to the URL). From this URL, you can construct a request using the verbs above, for example, http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=Identify. Some data providers do not share their metadata with harvesters, and their repository will not respond to an OAI request. Other data providers do not have a predictable OAI-PMH base URL, and you may need to email them and ask for it. More information about OAI-PMH can be found in Using the Open Archives Initiative Protocol for Metadata Harvesting (Cole and Foulonneau, 2007).
Although OAI-PMH was designed as a low-barrier method to sharing metadata records, many repositories choose not to share their records. Some of the reasons these repositories cite include a technical infrastructure that does not support OAI-PMH, metadata in an unshareable state, and the closed culture of various institutions.
Many institutions build their own digital object repositories without knowing how easy it is to share metadata through OAI-PMH. A wealth of information exists on the Open Archives website (http://www.openarchives.org/pmh/), and can be useful for explaining how to expose your institution’s metadata through OAI-PMH.
Some institutions do not design their metadata elements with shareability in mind, and, as a result, do not feel ready to share their metadata with other institutions. Although this may be true of your metadata, it might be worth the effort of re-examining the metadata with shareability in mind, to determine what changes need to be made to bring it up to standard. Increased visibility of your metadata, and findability through other services, will bring more visitors to your site and make users more aware of your collections. You may even reach new users.
Additionally, libraries, archives, and museums all come from different backgrounds. Libraries generally share non-unique copies of books, and, as a result, cataloging has been seen as a community effort, especially when technology can help the process. However, archives and museums are used to dealing with unique resources, and cooperative cataloging isn’t useful in this environment. The unique resources available through these institutions require individual cataloging, and metadata is institution specific. Now that resources are discoverable through the internet, there is no reason for any cultural heritage institution not to share metadata with potential users.
High quality metadata is important in the local environment because it helps your users find your resources. But the power of metadata can really be harnessed in the larger environment. Your collection may hold several thousand resources, but across the world, cultural heritage institutions hold billions of resources. Ensuring that your metadata is interoperable in this context will help researchers find information they need, and bring more attention to your collection.
In this book, we’ve given you a basic understanding of the theory and practice of metadata, examined Dublin Core, CDWA, VRA Core 4.0, and EAD, and presented a larger picture of how all of these metadata standards can fit together. Metadata creation may initially seem like an overwhelming task with different standards and vocabularies to choose from, but, as soon as you narrow your choices and see how each standard fits in your local environment, the correct path will become more obvious. Well-designed metadata can help your users find the information they need, understand the context of the resource and collection, and provide information about how to use the resource.
OAI for Beginners – the Open Archives Forum online tutorial http://www.oaforum.org/tutorial/
W3Schools.com. XSLT Tutorial http://www.w3schools.com/xsl/default.asp
First two description fields do not describe the image. Other description fields describe formats of digital files, and scanning specifications. This information, while helpful in a local environment, will create noise in an aggregated environment.
1. Use the Identify verb. For example, http://econtent.unm.edu-.81/cgi-bin/oai.exe?verb=Identify.
2. Use the ListRecords verb with metadataPrefix, from, and until. For example: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListRecords&metadataPrefix=oai_dc&from=2012-01-01&until=2012-06-30.
3. Use ListSets. For example: http://econtent.unm.edu:81/cgi-bin/oai.exe?verb=ListSets.