Perspectives on SolR / Zebra / SRU / Z39.50
Exact and quick Retrieval search by Indexing datas is a very important goal especially in Integrated library system domain.
1. Z39.50/SRU SUPPORT.
If I would have any specific concern about the prospect of new development complicating long term development, it would be about the possibility of breaking or neglecting necessary Z39.50/SRU server support in the process of adding excessively generic Solr/Lucene indexing. Z39.50/SRU are important library standards for record sharing which is vital to the good functioning of the library community.
I recommend for taking the issue of Z39.50/SRU support seriously and finding JZKit as a possible solution for Z39.50/SRU support using Solr/Lucene.
2. AVOIDING FEATURE REGRESSION OR BLOCKS TO FUTURE DEVELOPMENT.
Popular implementations of Solr/Lucene in library automation systems have made all the mistakes of sacrificing the precision needed for serious library research in return for high recall with poor relevancy often found in Google which may merely satisfy casual queries.
I share the concern that working with Zebra is too much like working with a black box into which one cannot peer. I make no claim that existing Z39.50/SRU Zebra support in Koha is ideal but merely than it should not be too easily sacrificed for something else with its own problems which are merely less familiar to us. I suggest that we retain the existing Z39.50/SRU Zebra support in Koha while adding other options which may improve local indexing.
The full use of Bib-1 position, structure, and completeness attributes for Z39.50 or the ordered prox CQL operator for SRU would allow the precise queries needed for serious research. The lack of a completeness operator in CQL is a serious deficiency for SRU. Index Data may still need to develop support in Zebra for the ordered prox CQL operator which will most likely require paying to support that effort when it would be appreciated in the Koha community. Zebra certainly has bugs as does all software.
Ultimately, I see no manageable way to have a free software library automation system without paying for some support for something from Index Data even if that would merely be Z39.50/SRU client programming libraries.
Solr/Lucene may now be a good choice for internal indexing in Koha. Lucene was not considered fairly during 2005 testing for Koha because the Perl bindings at that time were notoriously slow. Solr and Lucene have long had the mind share and development advantage of being Apache Foundation projects which Zebra will never match, hence the forthcoming inclusion of Solr/Lucene indexing for the next major versions of Pazpar2 and Zebra . However, Solr/Lucene has had problems which should not go unconsidered in evaluating or actually implementing Solr/Lucene based indexing in Koha.
2.1. HISTORICAL LACK OF PRECISION IN SOLR/LUCENE.
Solr/Lucene may have been a poor choice during the 2004 – 2006 period of sponsoring Perl Zoom and developing Zebra in Koha. Lucene had originally been developed for full text indexing of unstructured documents. Solr had originally been merely an easy to configure front end to a subset of Lucene functionality. Solr became a popular choice for the simplest free software OPACs. I have always tried to subject choices taken in Koha to personal reconsideration and made a modest investigation of the capabilities of Lucene and Solr/Lucene in 2007. I consulted widely and attended some conferences asking questions of the most expert implementers of library automation systems who had been using Lucene or Solr/Lucene. I tried to consult with people working to solve real problems rather than merely relying upon possibly incomplete documentation. In 2007, Solr provided no support for indexing to serve important concepts used for obtaining precision in library systems.
2.1.1. ASPECTS OF PRECISION HISTORICALLY UNSUPPORTED BY SOLR/LUCENE.
Hierarchy where some content is subsidiary to other content and content derives meaning from the place in the hierarchy had no support in Solr circa 2007. Field to subfield relationships is an example of hierarchy in MARC records. Namespace hierarchies are examples of hierarchy in XML records and are accessible by XPath queries. Hierarchy is a fundamental feature of classification and retrieval for easily including wanted record sets and excluding unwanted record sets.
Sequential order where the order of separate record sub-elements is relevant to meaning had no support in Solr circa 2007. Philosophy -History, meaning ‘history of philosophy’, is an entirely different subject from History – Philosophy, meaning ‘philosophy of history’. Note that the inversion of word order between individual controlled vocabulary elements and the corresponding English phrase with the same meaning. The sequential order of fields within a record or MARC subfields within a particular field are examples of sequential order in MARC records. The sequential order of namespaces within a record and the order of repeated elements within the same namespace are examples of sequential order in XML records accessible by XPath queries. Sequential order is a fundamental feature of meaning in language and is not necessarily reducible to phrase strings where interceding terms may or may not be present and word order may be inverted as in the example given.
2.1.2. ALTERNATIVES FOR PRECISION USING LUCENE.
In 2005, work at Bibliothèque de l’Université Laval (originators of RAMEAU) had developed LIUS (Lucene Index Update and Search) to overcome some difficulties of Lucene including fielded indexing of the very simplest flat field metadata found in some general purpose document types and XPath indexing for XML documents, http://sourceforge.net/projects/lius/ . Laval now uses Solr/Lucene based Constellio, http://www.constellio.com/ .
In 2007, I had been informed by a programmer of library automation systems working in the pharmaceutical industry, if I remember his job correctly, that hierarchical indexing and sequential indexing could be done in Lucene but that there was no support for such indexing in Solr. Precision is very important for both scientific and business purposes in the pharmaceutical industry. Despite valid criticism of some business practises within the pharmaceutical industry, lives are often at stake in their work.
We should treat the quality of information retrieval in library automation systems as if lives are at stake. Lives will sometimes be at stake in the research which people do.
2.2. CONSEQUENCES OF LACK OF PRECISION.
Sadly, the concept of precision has not been one which signified in the minds of those developing the popular free software OPACs using Solr/Lucene or some of their non-free equivalents. Examples of the consequences to which Koha is not excluded are using only $a in faceting despite the presence of other important subfields; jumbling all the subfields from all similar fields independently; and returning irrelevant results because subfields have been treated as mere independent keywords devoid of contextual meaning even in the context of a query using an authority controlled field.
Human nature, to which Koha is not immune, may have some impetus to oversimplify for an expected advantage. Oversimplification in the context of a library automation system could eliminate the ability for the user to access the real complexity and richness of relationships in bibliographic records to improve speed or robustness. Such oversimplification exists to a large extent in every actual library automation system.
I may be raising a false alarm about the possibility that some feature advance may complicate or block better improvements in the future. Yet, I prefer to take a vigilant stance rather than be sorry later for not having raised a concern.
2.3. CURRENT SUITABILITY OF SOLR/LUCENE.
I note significant improvements identified in the Solr/Lucene changelog from version 1.3 in 2008 and later.
The DataImportHandler was added in version 1.3. DataImportHandler has options for XPath based indexing.
Solr still seems to have no support for ordered proximity searches. Perhaps XPath based indexing would address the problem. A possible workaround modifying the Lucene code in SolrQueryParser to return SpanNearQuery instead of PhraseQuery may be a very undesirable remedy, breaking one feature to fix another.
Whether the improvements in Solr/Lucene are sufficient to overcome the past limitations which I have identified would require experimentation.
3. SUPPORT MODELS FOR NEEDED PROGRAMMING LIBRARIES.
It is good that companies such as Knowledge Integration, http://www.k-int.com/ , developers of JZKit, http://www.k-int.com/jzkit , are providing some free software competition and complementary work to what is available from Index Data.
Note that the JZKit developer, Ian Ibbotson, is using Yaz as a Z39.50 client, http://k-int.blogspot.com/2008/05 /exposing-solr-services-as-z3950-server.html leaving a dependency on Index Data for Z39.50 for client side services.
There is some prevarication at Index Data against fully embracing free software in everything they do. Inevitably they need revenue to be sustainable. The following thought about a possible shortage of Index Data development time and the consequences is merely speculative but not uninformed. Index Data may have a problem of not enough developers working for them with sufficient experience to further the development of the underlying programming libraries which we use to meet the amount of the work which the library community hopes to have from them. Contracting for Index Data development in the absence of sufficient development time to go around might result in bidding for the importance of the development which you need as much as it is sharing the cost of development with others.
Would working with Knowledge Integration which has even fewer developers be significantly different in terms of development costs? Does Knowledge Integration need less money for a given amount of work than Index Data does to be sustainable?
Consider that JZKit seems to have no documentation worthy of identifying as documentation. The source code repository contains about four pages of outlines for documentation with only one sentence of actual content, http://www.k-int.com/developer/downloads . There are some comments in the source code which I understand are used as documentation for JZKit. Yet the comments are too few and incomplete to be of sufficient use to me and from what I have noted others as well. There are some virtually empty example configuration files which could be used as a basis for speculating how configuration works. JZKit supposedly has a mailing list but I have not found it.
Index Data does provide documentation even if we have often found it inadequate for our needs in Koha development. I suspect that sufficient documentation at Knowledge Integration as at Index Data requires a support contract and as we know has no guarantees for completeness. Writing clear and thorough documentation is hard work. Writing documentation is the last thing which programmers generally want to do. Lack of good documentation is a common characteristic of free software.
If there would also be a need for JZKit to have some missing feature or better functionality, would the situation also be any different for Knowledge Integration development than Index Data development? See the unfortunate position of Knowledge Integration on GPL contributions or AGPL 3 contributions in the case of JZKit, http://www.k-int.com/developer/participate .
The library community needs to find the means of working more cooperatively to ensure a steady availability of development resources at companies such as Index Data and Knowledge Integration for sustainable shared development.
I hope that Index Data may eventually be won over from their sometimes prevaricative position towards free software. Yet they need to be sustainable by some means. I do not find the position of of Knowledge Integration to be any different and note that they do not have a link to the source code repository for OpenHarvest, http://www.k-int.com/developer/downloads . Index Data does have a long history of supporting free software for libraries. Index Data also makes an extraordinary almost impossible to be believed promise in their support contracts to fix any bug within set number of days.
The issue of how to share the cost of support contracts for programming libraries provided by companies such as Index Data or Knowledge Integration across multiple Koha support companies or even outside of the Koha community needs to be considered.
by Thomas Dukleth