{"id":195,"date":"2011-07-26T13:28:13","date_gmt":"2011-07-26T11:28:13","guid":{"rendered":"http:\/\/www.extradrm.com\/?p=195"},"modified":"2013-05-25T11:24:06","modified_gmt":"2013-05-25T09:24:06","slug":"solr-zebra-sru-z39-50","status":"publish","type":"post","link":"https:\/\/www.extradrm.com\/?p=195","title":{"rendered":"Perspectives on SolR \/ Zebra \/ SRU \/ Z39.50"},"content":{"rendered":"<p><strong>Exact and quick Retrieval search by Indexing datas<\/strong> is a very important goal especially in Integrated library system domain.<\/p>\n<p><strong>1.\u00a0 Z39.50\/SRU SUPPORT.<\/strong><\/p>\n<p>If I would have any specific concern about the prospect of new development complicating long term development, it would be about the possibility of breaking or neglecting necessary Z39.50\/SRU server support in the process of adding excessively generic Solr\/Lucene indexing.\u00a0 Z39.50\/SRU are important library standards for record sharing which is vital to the good functioning of the library community.<\/p>\n<p>I recommend for taking the issue of Z39.50\/SRU support seriously and finding JZKit as a possible solution for Z39.50\/SRU support using Solr\/Lucene.<\/p>\n<p><strong>2.\u00a0 AVOIDING FEATURE REGRESSION OR BLOCKS TO FUTURE DEVELOPMENT.<\/strong><\/p>\n<p>Popular implementations of Solr\/Lucene in library automation systems have made all the mistakes of sacrificing the precision needed for serious library research in return for high recall with poor relevancy often found in Google which may merely satisfy casual queries.<\/p>\n<p>I share the concern that working with Zebra is too much like working with a black box into which one cannot peer.\u00a0 I make no claim that existing Z39.50\/SRU Zebra support in Koha is ideal but merely than it should not be too easily sacrificed for something else with its own problems which are merely less familiar to us.\u00a0 I suggest that we retain the existing Z39.50\/SRU Zebra support in Koha while adding other options which may improve local indexing.<\/p>\n<p>The full use of Bib-1 position, structure, and completeness attributes for Z39.50 or the ordered prox CQL operator for SRU would allow the precise queries needed for serious research.\u00a0 The lack of a completeness operator in CQL is a serious deficiency for SRU.\u00a0 Index Data may still need to develop support in Zebra for the ordered prox CQL operator which will most likely require paying to support that effort when it would be appreciated in the Koha community.\u00a0 Zebra certainly has bugs as does all software.<\/p>\n<p>Ultimately, I see no manageable way to have a free software library automation system without paying for some support for something from Index Data even if that would merely be Z39.50\/SRU client programming libraries.<\/p>\n<p>Solr\/Lucene may now be a good choice for internal indexing in Koha. Lucene was not considered fairly during 2005 testing for Koha because the Perl bindings at that time were notoriously slow.\u00a0 Solr and Lucene have long had the mind share and development advantage of being Apache Foundation projects which Zebra will never match, hence the forthcoming inclusion of Solr\/Lucene indexing for the next major versions of Pazpar2 and Zebra .\u00a0 However, Solr\/Lucene has had problems which should not go unconsidered in evaluating or actually implementing Solr\/Lucene based indexing in Koha.<\/p>\n<p><strong>2.1.\u00a0 HISTORICAL LACK OF PRECISION IN SOLR\/LUCENE.<\/strong><\/p>\n<p>Solr\/Lucene may have been a poor choice during the 2004 &#8211; 2006 period of sponsoring Perl Zoom and developing Zebra in Koha.\u00a0 Lucene had originally been developed for full text indexing of unstructured documents.\u00a0 Solr had originally been merely an easy to configure front end to a subset of Lucene functionality.\u00a0 Solr became a popular choice for the simplest free software OPACs.\u00a0 I have always tried to subject choices taken in Koha to personal reconsideration and made a modest investigation of the capabilities of Lucene and Solr\/Lucene in 2007.\u00a0 I consulted widely and attended some conferences asking questions of the most expert implementers of library automation systems who had been using Lucene or Solr\/Lucene.\u00a0 I tried to consult with people working to solve real problems rather than merely relying upon possibly incomplete documentation.\u00a0 In 2007, Solr provided no support for indexing to serve important concepts used for obtaining precision in library systems.<\/p>\n<p><strong>2.1.1.\u00a0 ASPECTS OF PRECISION HISTORICALLY UNSUPPORTED BY SOLR\/LUCENE.<\/strong><\/p>\n<p>Hierarchy where some content is subsidiary to other content and content derives meaning from the place in the hierarchy had no support in Solr circa 2007.\u00a0 Field to subfield relationships is an example of hierarchy in MARC records.\u00a0 Namespace hierarchies are examples of hierarchy in XML records and are accessible by XPath queries.\u00a0 Hierarchy is a fundamental feature of classification and retrieval for easily including wanted record sets and excluding unwanted record sets.<\/p>\n<p>Sequential order where the order of separate record sub-elements is relevant to meaning had no support in Solr circa 2007.\u00a0 Philosophy -History, meaning &#8216;history of philosophy&#8217;, is an entirely different subject from History &#8211; Philosophy, meaning &#8216;philosophy of history&#8217;.\u00a0 Note that the inversion of word order between individual controlled vocabulary elements and the corresponding English phrase with the same meaning.\u00a0 The sequential order of fields within a record or MARC subfields within a particular field are examples of sequential order in MARC records.\u00a0 The sequential order of namespaces within a record and the order of repeated elements within the same namespace are examples of sequential order in XML records accessible by XPath queries.\u00a0 Sequential order is a fundamental feature of meaning in language and is not necessarily reducible to phrase strings where interceding terms may or may not be present and word order may be inverted as in the example given.<\/p>\n<p><strong>2.1.2.\u00a0 ALTERNATIVES FOR PRECISION USING LUCENE.<\/strong><\/p>\n<p>In 2005, work at Biblioth\u00e8que de l&#8217;Universit\u00e9 Laval (originators of RAMEAU) had developed LIUS (Lucene Index Update and Search) to overcome some difficulties of Lucene including fielded indexing of the very simplest flat field metadata\u00a0 found in some general purpose document types and XPath indexing for XML documents, http:\/\/sourceforge.net\/projects\/lius\/ .\u00a0\u00a0 Laval now uses Solr\/Lucene based Constellio, http:\/\/www.constellio.com\/ .<\/p>\n<p>In 2007, I had been informed by a programmer of library automation systems working in the pharmaceutical industry, if I remember his job correctly, that hierarchical indexing and sequential indexing could be done in Lucene but that there was no support for such indexing in Solr.\u00a0 Precision is very important for both scientific and business purposes in the pharmaceutical industry.\u00a0 Despite valid criticism of some business practises within the pharmaceutical industry, lives are often at stake in their work.<\/p>\n<p>We should treat the quality of information retrieval in library automation systems as if lives are at stake.\u00a0 Lives will\u00a0 sometimes be at stake in the research which people do.<\/p>\n<p><strong>2.2.\u00a0 CONSEQUENCES OF LACK OF PRECISION.<\/strong><\/p>\n<p>Sadly, the concept of precision has not been one which signified in the minds of those developing the popular free software OPACs using Solr\/Lucene or some of their non-free equivalents.\u00a0 Examples of the consequences to which Koha is not excluded are using only $a in faceting despite the presence of other important subfields; jumbling all the subfields from all similar fields independently; and returning irrelevant results because subfields have been treated as mere independent keywords devoid of contextual meaning even in the context of a query using an authority controlled field.<\/p>\n<p>Human nature, to which Koha is not immune, may have some impetus to oversimplify for an expected advantage.\u00a0 Oversimplification in the context of a library automation system could eliminate the ability for the user to access the real complexity and richness of relationships in bibliographic records to improve speed or robustness.\u00a0 Such oversimplification exists to a large extent in every actual library automation system.<\/p>\n<p>I may be raising a false alarm about the possibility that some feature advance may complicate or block better improvements in the future. Yet, I prefer to take a vigilant stance rather than be sorry later for not having raised a concern.<\/p>\n<p><strong>2.3.\u00a0 CURRENT SUITABILITY OF SOLR\/LUCENE.<\/strong><\/p>\n<p>I note significant improvements identified in the Solr\/Lucene changelog from version 1.3 in 2008 and later.<\/p>\n<p>The DataImportHandler was added in version 1.3.\u00a0 DataImportHandler has options for XPath based indexing.<\/p>\n<p>Solr still seems to have no support for ordered proximity searches. Perhaps XPath based indexing would address the problem.\u00a0 A possible workaround modifying the Lucene code in SolrQueryParser to return SpanNearQuery instead of PhraseQuery may be a very undesirable remedy, breaking one feature to fix another.<\/p>\n<p>Whether the improvements in Solr\/Lucene are sufficient to overcome the past limitations which I have identified would require experimentation.<\/p>\n<p><strong>3.\u00a0 SUPPORT MODELS FOR NEEDED PROGRAMMING LIBRARIES.<\/strong><\/p>\n<p>It is good that companies such as Knowledge Integration, http:\/\/www.k-int.com\/ , developers of JZKit, http:\/\/www.k-int.com\/jzkit , are providing some free software competition and complementary work to what is available from Index Data.<\/p>\n<p>Note that the JZKit developer, Ian Ibbotson, is using Yaz as a Z39.50 client, http:\/\/k-int.blogspot.com\/2008\/05 \/exposing-solr-services-as-z3950-server.html leaving a dependency on Index Data for Z39.50 for client side services.<\/p>\n<p>There is some prevarication at Index Data against fully embracing free software in everything they do.\u00a0 Inevitably they need revenue to be sustainable.\u00a0 The following thought about a possible shortage of Index Data development time and the consequences is merely speculative but not uninformed.\u00a0 Index Data may have a problem of not enough developers working for them with sufficient experience to further the development of the underlying programming libraries which we use to meet the amount of the work which the library community hopes to have from them.\u00a0 Contracting for Index Data development in the absence of sufficient development time to go around might result in bidding for the importance of the development which you need as much as it is sharing the cost of development with others.<\/p>\n<p>Would working with Knowledge Integration which has even fewer developers be significantly different in terms of\u00a0 development costs?\u00a0 Does Knowledge Integration need less money for a given amount of work than Index Data does to be sustainable?<\/p>\n<p>Consider that JZKit seems to have no documentation worthy of identifying as documentation.\u00a0 The source code repository contains about four pages of outlines for documentation with only one sentence of actual content, http:\/\/www.k-int.com\/developer\/downloads .\u00a0 There are some comments in the source code which I understand are used as documentation for JZKit.\u00a0 Yet the comments are too few and incomplete to be of sufficient use to me and from what I have noted others as well.\u00a0 There are some virtually empty example configuration files which could be used as a basis for speculating how configuration works.\u00a0 JZKit supposedly has a mailing list but I have not found it.<\/p>\n<p>Index Data does provide documentation even if we have often found it inadequate for our needs in Koha development.\u00a0 I suspect that sufficient documentation at Knowledge Integration as at Index Data requires a support contract and as we know has no guarantees for completeness.\u00a0 Writing clear and thorough documentation is hard work.\u00a0 Writing\u00a0 documentation is the last thing which programmers generally want to do.\u00a0 Lack of good documentation is a common characteristic of free software.<\/p>\n<p>If there would also be a need for JZKit to have some missing feature or better functionality, would the situation also be any different for Knowledge Integration development than Index Data development?\u00a0 See the unfortunate position of Knowledge Integration on GPL contributions or AGPL 3 contributions in the case of JZKit, http:\/\/www.k-int.com\/developer\/participate .<\/p>\n<p>The library community needs to find the means of working more cooperatively to ensure a steady availability of\u00a0 development resources at companies such as Index Data and Knowledge Integration for sustainable shared development.<\/p>\n<p>I hope that Index Data may eventually be won over from their sometimes prevaricative position towards free software.\u00a0 Yet they need to be sustainable by some means.\u00a0 I do not find the position of of Knowledge Integration to be any different and note that they do not have a link to the source code repository for OpenHarvest, http:\/\/www.k-int.com\/developer\/downloads .\u00a0 Index Data does have a long history of supporting free software for libraries.\u00a0 Index Data also makes an extraordinary almost impossible to be believed promise in their support contracts to fix any bug within set number of days.<\/p>\n<p>The issue of how to share the cost of support contracts for programming libraries provided by companies such as Index Data or Knowledge Integration across multiple Koha support companies or even outside of the Koha community needs to be considered.<\/p>\n<p>by Thomas Dukleth<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Exact and quick Retrieval search by Indexing datas is a very important goal especially in Integrated library system domain. 1.\u00a0 Z39.50\/SRU SUPPORT. If I would have any specific concern about the prospect of new&#46;&#46;&#46;<\/p>\n","protected":false},"author":1,"featured_media":2847,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[33,2,13],"tags":[347,183,184],"youtube_video":null,"_links":{"self":[{"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/posts\/195"}],"collection":[{"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=195"}],"version-history":[{"count":0,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/posts\/195\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=\/wp\/v2\/media\/2847"}],"wp:attachment":[{"href":"https:\/\/www.extradrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=195"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=195"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.extradrm.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}