SolR : XML Schema guide
[avatar user=”extradrm” size=”thumbnail” align=”left”]
schema.xml is usually the first file you configure when setting up a new Solr installation. The schema.xml file contains all of the details about which fields your documents can contain, and how those fields should be dealt with when adding documents to the index, or when querying those fields. I will only cover the most commonly-used configuration elements.
The schema.xml declares the following :
what kinds of fields there are
which field should be used as the unique/primary key
which fields are required
how to index and search each field
The XML consists of a number of parts. We’ll look at these :
A- Field Types
A- Field Types
<types> <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0"/> ... </types>
The example Solr schema.xml comes with a number of pre-defined field types, and they’re quite well-documented. You can also use them as templates for creating new field types.
The commonly used ones are:
A generically useful text field. Its described in the documentation as:
A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of “wifi” or “wi fi” could match a document containing “Wi-Fi”. Synonyms and stopwords are customized by external files, and stemming is enabled.
Useful when you have a text field which you don’t want tokenized, like IDs. Its described in the documentation as:
The StrField type is not analyzed, but indexed/stored verbatim. – StrField and TextField support an optional compressThreshold which limits compression (if enabled in the derived fields) to values which exceed a certain size (in characters).
Useful for dates. Its described in the documentation as:
The format for this date field is of the form 1998-12-31T23:42:59Z, and is a more restricted form of the canonical representation of dateTime http://www.w3.org/TR/xmlschema-2/#dateTime
float and int
The Solr Wiki also has some information on field types.
<fields> <field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="textgen" indexed="true" stored="true"/> ... </fields>
The documentation provides a list of valid attributes:
name: mandatory – the name for the field
type: mandatory – the name of a previously defined type from the section
indexed: true if this field should be indexed (searchable or sortable)
stored: true if this field should be retrievable
compressed: [false] if this field should be stored using gzip compression (this will only apply if the field type is compressable; among the standard field types, only TextField and StrField are)
multiValued: true if this field may contain multiple values per document
omitNorms: (expert) set to true to omit the norms associated with this field (this disables length normalization and index-time boosting for the field, and saves some memory). Only full-text fields or fields that need an index-time boost need norms.
termVectors: [false] set to true to store the term vector for a given field. When using MoreLikeThis, fields used for similarity should be stored for best performance.
termPositions: Store position information with the term vector. This will increase storage costs.
termOffsets: Store offset information with the term vector. This will increase storage costs.
default: a value that should be used if no value is specified when adding a document.
The Solr Wiki has more information on fields like dynamic fields etc.
Equivalent to the primary key of the document.
Field to use to determine and enforce document uniqueness. Unless this field is marked with required=”false”, it will be a required field
Field for the QueryParser to use when an explicit fieldname is absent
Used for determining if multiple terms are ANDed or ORed together by default.
SolrQueryParser configuration: defaultOperator=”AND|OR”
For example, with the following query :
q=extradrm solr tutorials
a setting of :
will produce the following Solr boolean query
q=extradrm AND solr AND tutorials