skip to navigation

catalog button
Search for

navigation
ask a librarian button

Features of the IDR: A techno status report

This purpose of this posting is to share with some the institutional repository community some of the ways we have been "hacking" DSpace, ETD-db, and DigiTool for the sake of open access publishing. Many of the things I outlined a number of months ago are coming to fruition.

DSpace, ETD-db, and DigiTool are institutional repository-like applications. Each of these applications have their own strengths and weaknesses, but the biggest problems facing their implementation here at Notre Dame include:

  • they are "applications" not libraries/modules
  • they will always have a particular look & feel
  • they operate as "information silos"
  • they implement specific searching/browsing interfaces

To overcome these issues, as well as in an effort to improve upon the functionality of an institutional repository, we here at Notre Dame have hacked (and we mean that in a good way) each of these applications to create something different. Specifically, we have exploited OAI-PMH to first harvest and cache the content from each application, and then we provide services against the cache. Some of these processes are itemized and described below.

Be forewarned. The things listed below are implemented in a "sandbox". Response times will be slow, the links will change, and your milage will vary.


0. Enhanced Dublin Core - For better or for worse, we created sets of facet/term combinations (think "subjects") and inserted them into DSpace fields. When we harvest the content we note its physical structure, parse it accordingly, and cache it in a specific place. A good record to see how some of this has been implemented is here:

http://dspace.library.nd.edu:8080/dspace/handle/2305/142?mode=full&submit_simple=Show+full+item+record


1. Harvest & cache - Each application has an OAI-PMH interface. We use Net::OAI::Harvester to harvest the Dublin Core metadata from each of the applications and cache the content in a (MyLibrary) database. As the content is retrieved we do a (tiny) bit of normalization, but we also supplement the metadata with facet/term combinations describing where the content came from, what format it is, and some sort of subject/descriptor.

2. Name authority - For a sub-set of our implementation, the Excellent Undergraduate Research, we needed to include photographs and short biographies of students in browsable displays. To accomplish this we first created facet/term combinations (name authorities) in our database. These authorities defined a key which pointed to a directory containing a JPEG image and more detailed information regarding the author. By combining these facet/terms, the files in the directory, and the records in DSpace we are able to create a browsable list of authors complete with pictures and bios. For example, see:

  http://dewey.library.nd.edu/morgan/idr/undergrad/?cmd=authors

3. Standards-based search - Each application provides search, but not in a "standard" way, nor against each other. By indexing our cache and providing an SRU interface to the index we can over come these problems. Moreover, using such an approach we can swap out our indexer at will. We began using swish-e as our indexer. We moved to Plucene (slow), and we will probably switch again. Fun:

  http://dewey.library.nd.edu/morgan/idr/sru/client.html

4. Syndicating content with "widgets" - The name of the game is, "Put your content were the users are; don't expect the users to come to your site." We created a number of one-line Javascript "widgets" allowing Web masters to insert content from our institutional repository into their pages. For example, the Aerospace Engineering department might want to list the recently "published" theses/ dissertations, or an author might want to do something similar. See:

  http://dewey.library.nd.edu/morgan/idr/widgets/widget-03.html
  http://dewey.library.nd.edu/morgan/idr/etd/widgets/widget-03.html
  http://dewey.library.nd.edu/morgan/idr/etd/?cmd=widgets

5. Pseudo peer-review - Faculty want to be recognized (and cited) by their peers. Right now the standard for this practice is publication in peer-reviewed journals and citation counts from ISI. Google is a trusted resource. ("If you don't believe me, then why do you use it so frequently?") As more and more content is made available on the Web things like Google PageRank and links from remote documents may supplement peer-review and citation counts. Using Google API's it is possible to retrieve a PageRank and lists of linking documents, and this has been implemented against some of the browsable lists in our
ETD collection. See:

  http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=25
  http://dewey.library.nd.edu/morgan/idr/etd/?cmd=term&id=50

Unfortunately, all of our documents have PageRanks of 0, 3, or 4, and all of the linking documents come from within our own domain. After time I think this may change.

6. Thumbnail browsing - We are using DigiTool to store images for art history classes. Not only is this content fraught with copyright issues, but the content is based on pictures, not words. As content is harvested from DigiTool's (non-standard) OAI-PMH interface we are able to "calculate" the location of thumbnail images on the remote server. Consequently we are able to provide (standard) search interfaces to the content and display pictures of hits as well as descriptions.

7. Batch loading - One problem with institutional repositories is getting content. It is hard to acquire and hard to key in. To get around this problem we searched our bibliographic indexes for things written by local authors. We saved these records to EndNote, a bibliographic citation manager. We then exported the EndNote content as an XML file, parsed the file, and saved the results in directories importable by DSpace. After importing we were able to cache the content and provide browsable interfaces against it. Using this technique it took two people less than one day to import more than 600 citations. For example:

  http://dewey.library.nd.edu/morgan/idr/?cmd=term&id=43331

There are a few other "kewl" features of our implementation, but the ones outlined above give you the gist of where we are going and what is possible as long as you have the data. "I don't need the interface; just give me the data." Our experiments have not been 100 percent successful. We still have some problems with

  • controlled vocabularies,
  • normalization,
  • scalability,
  • getting content in the first place,
  • priority setting, and
  • links to the "real" content as opposed to "splash" screens.

Despite these issues, we believe thing are moving forward and in the right direction.

Fun with institutional repositories.

 

All libraries: Architecture | Art Image | Business Information Center | Chemistry & Physics | Engineering | Hesburgh (Main)
Kellogg/Kroc Information Center | Life Sciences | Mathematics | Rare Books & Special Collections | Radiation Lab | Kresge Law