We had another fantastic conference at Code4Lib this year. The conference seemed better organized with a larger attendance than last year, as we basically filled to capacity the main conference hall at the Georgia Center. And, with very few exceptions, the key note speakers and the other presentations were very good, if not excellent. Indexing was a strong theme at this year's conference along with the theme of "the OPAC as we know it is dead". A few people I talked with commented that while all of the presentations were very good, and the projects being talked about are great, we aren't really sharing either the knowledge of how to implement things nor the work load between institutions, and we should be. So, while NCSU, CDL, MIT, and Princeton are doing wonderful things, they haven't really opened up their software development to the community. It isn't that the desire is to keep projects bottled up as much as it is a lack of experience doing development with an open source mentality.
Lucene and SOLR
The preconference workshop was an introduction to using the Apache software SOLR as a web services front end to the uniquitous Lucene indexing software. Erik Hatcher of "Lucene in Action" fame was the facilitator, and he is chock full of anecdotes and technology related tales of success and woe. This was basically an 8 hour workshop designed with a "theory" portion and a "hands on" portion. Given that the crowd was very mixed with some folks who knew Lucene inside and out and others who were doing indexing for the first time, the workshop was probably handled in the best way possible. But, because there were so many questions at different levels of expertise, along with a lot of "show and tell" for folks who were using Lucene and SOLR in production applications, there was very little step by step instruction. A lot of knowledge was assumed, and a lot of communication had taken place on the code4lib listserv prior to the conference, trying to get folks up to speed so perhaps the more introductory parts had happened even before the workshop began. This was a little dissapointing because I didn't have a lot of time prior to the conference to scan the listserv and absolutely no time to participate on IRC to see what was being accomplished.
During the first half of the workshop, Erik spent much of the time discussing the infrstructure of SOLR and it's configuration files (referring to pre-workshop materials he had passed out electronically in the prior weeks), and then related that to a Ruby on Rails application he had built called "Flare". The Ruby application was the front end to the middleware of SOLR interfacing with Lucene, and for demonstration purposes, Erik had indexed over 3.7 million bib records from his institution's catalog. The Flare application then communicates with SOLR to produce a front end display which allowed for faceted browsing and searching of the records. Erik had also built in some other nice features into Flare that allow the user to keep track of previous browses, refine browsing, and perform searching. SOLR actually does the analysis of the data fed to it, and can return XML data organized into facets, perform keyword or fielded searching, etc.
The mechanism for working with SOLR is pretty simple. Data, in whatever format, needs first to be converted into into a form of XML that is compatible with SOLR. The SOLR configuration files need to be tweaked so that it knows which fields to index, and which fields are appropriate for keyword searching, as well as which fields refer to faceted vocabulary. If very large numbers of records need to be indexed, then the memory and caching parameters of SOLR need to be tweaked as well. All of this takes place inside XML configuration files. Once the configuration has been completed, and the program to convert the data into the appropriate XML format is in place, one can either feed the XML files to SOLR over a HTTP connection or tell SOLR to index a set of files on the drive.
Once the data has been indexed in Lucene via SOLR, then web based queries can be sent to SOLR to retrieve data sets according to a set of criteria, just like any web based CGI application. SOLR is a java application, however, and so it needs to run in a servelet container like Tomcat. SOLR, because of Lucene, is able to add and update records during production without the need to completely rebuild the index. A unique record identifier is required to update or delete records. It was also pointed out that SOLR itself should run in a protected environment such that only certain applications have access to SOLR, otherwise anyone could instruct SOLR to add, delete or modify records.
While SOLR is very efficient, and is fully featured as an indexing front end, the one weakness that I perceive is the lack of open standards support. For example, SOLR does not support SRW/SRU out of the box. To add standards based web services to SOLR, another layer would need to be written on top of SOLR, or SOLR itself would need to be enhanced to provide that functionality. This is rather disappointing, but it doesn't completely mitigate against using SOLR as an indexer. It looks like we can quickly prototype applications with SOLR which need a robust indexer, and use it for production applications that do not immediately require a standards based web service interface. Perhaps some SOLR experimentation is in order.
State of Emergency
Karen Schneider of Florida State University gave a very good talk as the opening keynote speaker regarding the relationship between librarianship, technology, and the control that we have over our collections in the digital age. I thought her talk was provacative in a positive sense. She began by outlining the current state of affairs in libraries and how that has redefined what it means to be a librarian. Leading indicators include the fact that in many respects we have given away our collections to vendors. This includes our serials content and in some instances even our monographic content. It also includes the fact that we usually don't build or manage our own tools of the trade. Yet, we still function like a monopoly when the competition is thriving under our noses. I assume that she was referring to online tools such as Google Scholar or even Amazon.
Given this state of affairs, Karen reminds us that as librarians, we exist as the "memory keepers" for our civilization. Google, Amazon, Flikr and other commercial sites do not exist for the greater good. She pointed out that Google's philosophy is "don't do evil", but they don't define the good either. As librarians, we promote the good by collecting and preserving the memories of our culture so that future generations will always have access to that cultural heritage. Commercial vendors are not in the business of preserving culture because it isn't in their best interest. They are not altruistic, they are capitalistic. And so, what can we do about this situation?
Karen mentions five things that she thinks we can fix with this situation. The five are: digital preservation, standards adoption, the state of the OPAC, the domination of vendors over library tools, and scholarly awareness of key issues regarding the library. On all of these issues, Karen is optomistic. There are signs that the impetus and the skills required to make a positive change are taking shape. For example, the PINES Evergreen project in Georgia is an example of a success story along with Ross Singer's Umlaut project, the Scriblio web site project, and the Apache SOLR application being demonstrated at the conference. In other words, she believes that we are at the beginning of a rennaisance of library software development which is beginnning to shape the "balance of power" from vendors back to libraries, reinstate the direction of the profession as a whole putting emphasis back on the core of librarianship. However, Karen warns that this message needs to be emphasized soon and forcefully. Another sign of hope that she sees is that we are beginning to take tentative steps to decouple the monolithic applications into components and incorporating data re-use into our development.
The picture isn't completely rosey, though. There are some "worrisome trends" and overgeneralizations. Those trends include the apathy toward open source software, the apathy toward true standards adoption, lack of emphasis on usability, and lack of publicity for successful projects such as Evergreen / PINES. Library administrations have legitimate, but sometimes unfounded, concerns such as the unfinished nature of open source software ("Is it fully baked?"), support issues, spending staff time, environmental scans, and overall cost. Karen gave her insights as an administrator into how administrators think, and how best to present these issues to sceptical administrators. Education and facts are key if a productive discussion is to take place. But, in the end, given the trends in library culture, she recommends that every library, no matter what size, should have at least one developer on staff. Given the lack of polish on many vendor products these days, and even for those that are "fully baked" (Endeca), it seems imperative that more developers become a regular mainstay of library staff, if nothing else than to take full advantage of the vendor supplied software.
MyResearch Portal
An example of independent software development was next to take the podium. Andrew Nagy from Villanova University presented a project that they are working on using java, XML, XSLT (with XQuery) and a database backend. The application is designed to be an OPAC like front end to facilitate the research needs of the school. Nagy expressed the general dissatisfaction that they had with vendor supplied products and the desire to build a "ILS agnostic" web interface which could be used to access any of the resources that the library facilitated access to. The interface has faceting capabilities and they are building in new bells and whistles as they go. Early on, when they were using Berkely DB as a back end, they ran into some severe scalability issues. Later they also experimented with SleepyCat and Oracle. And, after many optimizations, they were getting individual queries down to .9 seconds but there is still an overload if they have upwards of 50 simultaneous users. Now, they are considering SOLR as an indexing back end. A question was asked regarding whether or not they had considered making their interface standards based, and the only reply was that it was an issue to be considered. Personally, I thought this was a poignant question if we are really serious about decoupling services from the current monolithic ILS.
Free the Data / Smart Subjects
Emily Lynema from NCSU, building on what Tito Sierra has presented at other conferences, talked a bit about the hows and whys of what NCSU has done with their catalog data in order to separate it from the ILS. Emily emphasized the importance of being able to use the data that the library has spent many years building up in order to create specialized services on the web front end, making creative use of XML and web services such that the presentation of data to the user is much more flexible. She demonstrated the incorporation of ISBN REST requests, an RSS layer, and OpenSearch integration with the library's Quick Search functionality as well as faceted browsing. One of their concerns was to attempt to highlight as many paths to information as possible, making sure that the user was aware of all of their options right up front. Finally, they wanted to be able to push the OPAC content out to mobile devices so that patrons who were out in the stacks could dynamically query the catalog for information. This makes heavy use of XML streams and XSL stylesheets to transform the data into a device compatible display. All in all, it was a very interesting demonstration of creative use of data outside the context of the traditional OPAC. However, once again, the question of open standards came up, and Emily had a hard time answering the question because they hadn't explored the issue. As great as these applications are, if we aren't taking advantage of standards, it makes me wonder how portable they would be to different environments.
Later in the conference Tito Sierra demonstrated how NCSU has been experimenting with subject recommendations in their web interfaces in a project they call "Smart Subjects". What they've done is very interesting. They harvested data from course descriptions and faculty publication repositories and then created indexes of that material. If an item a person is looking at has certain unique phrases or words in it, their application will attempt to suggest other areas of interest that are related to that phrase/term. This produces a browseable interface of related subject areas in order to assist the patron in finding related items. The part I found most interesting was the harvesting of publication repositories and course descriptions for subject associations. Tito mentioned that they are also considering scanning the article table of contents data in order to boost the relevance of their recommendations.
Herding Cats
By far, my favorite presentation was Mike Rylander's presentation on project management and development workflow. Mike is one of the developers for the Evergreen project in Georgia and this is the second time I've heard him talk. Mike addressed some of the core issues that libraries will face if they move foward with major development projects, and he provided his own experience from developing Evergreen. Here are some tips for development timeline goals:
Long Term
Medium Term
Short Term
Here are some do's and dont's regardint the development cycle:
Do's:
Dont's:
Library Data APIs
Tallis Coporation had a presence at the conference, and were "gold" sponsors of the event. They were given a chance to present their vision of the future of library services and the technology revolution that needs to take place in libraries if they are to remain competitive on the information landscape. Richard Wallis gave a lively presentation, the theme of which was separation of services. As we are all aware, the future of the monolithic library system is bleak. Talis proposes that we take the Amazon.com approach. Instead of storing all of our data locally in one system for which we pay exhorbitant licensing fees, they suggest an approach not unlike OCLC. Various types of records (bibliographic, holdings, etc) should be stored in a remote repository on top of which multiple services could be built. These services would include the user interface (completely customizable), web services queries, and general interface 'augmentation'. Instead of an interface completely written by the ILS vendor, the interface would take a component approach. The bibliographic data would come from one data store, the holdings data from another store, dust jacket images from yet another source, authority information from another source, faceted data from yet another. It reminded me of a sophisticatd and highly orchestrated mashup scheme, or a service oriented architecture approach not unlike what Peter Murray presented at the ILS symposium in Windsor, Canada. Actually, this type of approach does seem to be the future of library web architectures, but Talis didn't give any figures on what the infrastructure for these services would cost, and that's where I'm a little bit skeptical.
Faculty Pages - The BibApp
Eric Larsen and Nate Vack from the University of Wisconsin at Madison gave a very interesting presentation on an innovative approach to providing campus faculty with a uniquely integrated interface highlighting the research that has been done at a given institution. They call the prototype "BibApp" and it incorporates over 12,000 citations and 2700 papers ready to be archived. The biggest problem with the sustainability of this application is that they are pulling data from a proprietary source (Sherpa Data). However, the interface has a lot of functionality built into it, highlighting the research the faculty has done, linking between research projects, providing citations to completed research, faceted browsing, and search capabilities (along with a photo of the faculty member on the appropriate pages). This is very reminicent of what we have been trying to do here with the digital repository project, but obviously their access to data is much more liberal than what we have. They admit several problems with the application, though, the smallest of which not being the proprietary data. They are also having difficulty with authority control, and it causes citations to be mis-matched with the incorrect authors. They understand that these problems are a deal killer, but they've been given funds to continue working on the prototype and perhaps the UofW will make this a reality within the next year.
Obstacles to Agility
Joan Starr from the California Digital Library gave another excellent presentation on project management and academic culture. She explored reasons why, in an academic environment, we are not "agile" enough to complete major development projects, and possible remedies to this problem. There are several obstacles that need to be highlited: the academic culture itself which tends to be a "non-participatory democracy" (plenty of critique with no commitment to doing the work), the non-representative nature of the academic environment, the problem of committees for committees (too many groups not getting enough done), the problem of ownership, diffuse decision making, the problem of projects going on ad naseum, lack of advocacy, lack of morale when projects fail, slow hiring policies, steep administrative heirarchies, and lack of flexibility for grant funded positions. While this seemed to be a rather negative assessment, Joan points out that we can remedy this issues if we are willing to change the culture and be more proactive. I think that the overall admonition was that we need to have more foresight at a local and institutional level, with a willingness to take risks. We also need to be willing, in our development projects, to "release early and often" in order to encourage participation both from within and outside our institutions.
The Tech Heads
Several presentations at the conference were done so quickly or without adequate context that they were difficult to grasp. The ideas were innovative and helpful, but the presenters seemed to be speaking mainly to a group of experienced "tech heads" who had done previous work with the technology in question. Dan Chudnov presented using his usual humorous style and sound byte Powerpoint slides. His presentation was entitled "Fun with ZeroConfMetaOpenSearch", which speaks volumes for the lucidity of his presentation. Essentially, Dan's focus was that libraries need to be more like iTunes in order to stay relevant. ZeroConf is a technology pioneered by Apple Computing several years ago to take the complexity out of computer networking. Dan demonstrated the way iTunes uses this technology to "advertise" collections of iTunes content on a given network and his suggestion was that we can combine this technology with "MetaOpenSearch" (unAPI, Open URL, Metasearch, etc.) to automatically advertise library service to devices that connect to the library's network. My understanding was that we would need to somehow piggy back on the Apple protocol or devise a new protocol that would become the library "service finder" protocol. Ed Summers presented a primer on the Atom Publishing Protocol which is very similar to RSS, but it's more robust. Atom allows for a heirarchical grouping of resources withing specified collections. Collections live in word spaces and word spaces correspond to services. Atom allows dynamic publishing of web based resources, not completely unlike WebDav. HTTP commands such as GET, PUT, POST and DELETE can be used to create and modify resources. The protocol, as I understand it, is expressed in XML and has a defined standard Both last year's and this year's conference had ample time for "lightning" talks, which are very short (5 min?) talks about whatever topic the person wishes to share. There were far too many to comment on this year, but they included topics such as using Apache's mod_filter module for mash ups, a brilliant new spelling routine for CDL's XTF software, a PHP/Pear module for parsing MARC records called File_MARC, and an application for processing web stats called AW Stats which looked interesting.
Library in a Box
Bess Sadler from the University of Virginia library highlighted an initiative she and some others at the conference were participating in to try and provide a low or no cost ILS to foreign libraries that cannot afford to purchase or support a long term ILS license. Some characteristics they are interested in include ease of installation, full internationalization, cutting edge technology and sustainability. This project is funded via the Open Society Institute and the eIFL (electronic information for libraries) organization.
Intellectual Property and Disclosure
Michael Doran from the University of Texas at Arlington gave an engaging presentation on the issue of intellectual property and open source software development at academic institutions. Michael had some interesting experiences developing two pieces of software to work in conjunction with Endeavor's Voyager ILS product. In both instances, he was asked to speak with the intellectual property committee at UTA. On the first occasion, he was granted permission to distribute his software, but on the second occassion, the University decided to try and license his software like they would an engineering patent. It turned out badly. Michael shared concerning the lessons he learned about IP disclosure statements and processes and advised that anyone doing open source development to look into the legal ramifications to avoid any entanglements.
Final Thoughts
While I was at the conference I spent some time talking with a few groups of people about the possibility of capitalizing on the growing technical skill level in academic libraries, and attempting the development of a new open source OPAC. I referred to the experience Eric Morgan has had with the Rochester XC group, which most people had heard of and were very interested in. Overall, I received positive feedback, but folks also seemed tentative regarding their own ability to participate in such a project. Many of the librarians that had the mentality that librarians do library work and programmers do programming work, and to complete a project of this magnitude, we would need to hire professional programmers. I'm certainly not opposed to that notion, but given the excellent presentations at the conference, and the innovative things that people are doing in the library IT community I wondered why this wouldn't be possible. Obviously, this is a complicated issue, and would require the support of the library administration for any institutions that would participate. Part of the issue may be a need to overcome the fear of failure. But, as was pointed out at the conference by Karen Schneider and others, the Evergreen project is really a great example of what can be done with a modest amount of money and folks who are dedicated to the task. If we can do that with the ILS as a whole, why can't we do that with the public face of the ILS?
My sincere hope is that academic libraries will continue to see the benefit in expanding their own IT infrastructure and increasing the overal technical skills of the library staff. And, I truly hope that Karen is correct in that we are beginning to see the start of a rennaisance of library software development. The signs certainly are there, and the motivation seems high. And, I'm hoping that more and more academic libraries are willing to move in that direction, take some risks, and reap the long term benefits.
Site Last Modified:
Tuesday, March 13, 2007
All libraries:
Architecture | Art
Image | Business Information Center
| Chemistry & Physics
| Engineering | Hesburgh
(Main)
Kellogg/Kroc Information Center |
Life Sciences | Mathematics
| Rare
Books & Special Collections | Radiation
Lab | Kresge Law