Loading data from Mysql to Solr with a Data Import Handler

Solr is used not as a primary data store because it is a Search Platform whose primary purpose is giving the ability to do complex searches with blazing performance. This means that you usually have your data in a primary data store, like mysql Server and you need to move data to a Solr server to power up your searches.

I am using SolR 4.9.1 under windows vista with wampserver for this tutorial with a single core collection1 ….

In our case we used MySQL so we added in the contrib / dataimporthandler / lib :
mysql-connector-java-5.1.32-bin.jar

mysql-jar

 

Nota Bene : If you were using SQL Server then you will need to  add sqljdbc4.jar that contains classes needed to connect to a SqlServer database from java jdbc.

First step is to use the sample mysql database used by Apache SolR tutorials (let us start wampserver and create a database called test with root user an without password :

CREATE TABLE `item` (
`ID` int(11) NOT NULL AUTO_INCREMENT,
`NAME` varchar(450) DEFAULT NULL,
`MANU` varchar(450) DEFAULT NULL,
`WEIGHT` float DEFAULT NULL,
`PRICE` float DEFAULT NULL,
`POPULARITY` int(11) DEFAULT NULL,
`INSTOCK` tinyint(4) DEFAULT NULL,
`INCLUDES` varchar(450) DEFAULT NULL,
`last_modified` TIMESTAMP,
PRIMARY KEY (`ID`)
);

CREATE TABLE `category` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(450) DEFAULT NULL,
`last_modified` TIMESTAMP,
PRIMARY KEY (`id`)
);

CREATE TABLE `feature` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`description` varchar(450) DEFAULT NULL,
`item_id` int(11) DEFAULT NULL,
`last_modified` TIMESTAMP,
PRIMARY KEY (`id`),
KEY `fk_feature_1` (`item_id`),
CONSTRAINT `fk_feature_1` FOREIGN KEY (`item_id`) REFERENCES `item` (`ID`) ON DELETE NO ACTION ON UPDATE NO ACTION
);

CREATE TABLE `item_category` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`CATEGORY_ID` int(11) DEFAULT NULL,
`item_id` int(11) DEFAULT NULL,
`last_modified` TIMESTAMP,
PRIMARY KEY (`id`),
KEY `fk_item_category_1` (`CATEGORY_ID`),
KEY `fk_item_category_2` (`item_id`),
CONSTRAINT `fk_item_category_1` FOREIGN KEY (`CATEGORY_ID`) REFERENCES `category` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION,
CONSTRAINT `fk_item_category_2` FOREIGN KEY (`item_id`) REFERENCES `item` (`ID`) ON DELETE NO ACTION ON UPDATE NO ACTION
);

Adding some Sample Data in MySQL:

insert into item values(id,'item 1','item 1 manu','1.2','100.23','1',1,'includes item 1',CURRENT_TIMESTAMP);

insert into feature values(id,'feature item 1',1,CURRENT_TIMESTAMP);
insert into category values(id,'music',CURRENT_TIMESTAMP);
insert into item_category values(id,1,1,CURRENT_TIMESTAMP);

After that we have to prepare collection xml config file :
in solrconfig.xml add (windows)

..

<lib dir="C:\solr\contrib\dataimporthandler\lib\" regex=".*\.jar" />
<lib dir="C:\solr\dist\" regex="solr-dataimporthandler-\d.*\.jar" />

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
          <str name="config">db-data-config.xml</str>
      </lst>
</requestHandler>    
...

Now create in config collection1 folder : db-data-config.xml (From Solr version 4.4 and +++). This Data Import configuration file will specify how you want to import data inside your Solr.

<dataConfig>
    <dataSource driver="com.mysql.jdbc.Driver" type="JdbcDataSource"
       url="jdbc:mysql://localhost:3306/test" user="root" password="" />
<document>
<!-- this entity is the 'root' entity. -->
    <entity name="item" query="select * from item" deltaQuery="select id from item where last_modified > '${dataimporter.last_index_time}'">
	<field column="NAME" name="name" />
	<!-- This entity is nested and reflects the one-to-many relationship between an item and its multiple features. Note the use of variables; ${item.ID} is the value of the column 'ID' for the
	current item ('item' referring to the entity name) -->
	<entity name="feature"
		query="select DESCRIPTION from FEATURE where ITEM_ID='${item.ID}'" deltaQuery="select ITEM_ID from FEATURE where last_modified > '${dataimporter.last_index_time}'"
		parentDeltaQuery="select ID from item where ID=${feature.ITEM_ID}">
	<field name="features" column="DESCRIPTION" />
	</entity>

	<entity name="item_category"
		query="select CATEGORY_ID from item_category where ITEM_ID='${item.ID}'"
		deltaQuery="select ITEM_ID, CATEGORY_ID from item_category where last_modified > '${dataimporter.last_index_time}'"
		parentDeltaQuery="select ID from item where ID=${item_category.ITEM_ID}">
		<entity name="category" query="select DESCRIPTION from category where ID = '${item_category.CATEGORY_ID}'"
			deltaQuery="select ID from category where last_modified > '${dataimporter.last_index_time}'"
			parentDeltaQuery="select ITEM_ID, CATEGORY_ID from item_category where CATEGORY_ID=${category.ID}">
			<field column="description" name="cat" />
		</entity>
	</entity>
    </entity>
</document>
</dataConfig>

In case it was an sql server, I like to show a more simple sample DIH handler :

<dataConfig>  
    <dataSource type="JdbcDataSource"
            driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
            url="jdbc:sqlserver://10.0.10.104;databaseName=thesaurus;"
            user="sa"
            password="zzzzzz"
            batchSize="5" /> 
             
    <document name="TestDocument">  
        <entity name="TestEntity" query="SELECT * FROM tag">  
            <field column="Id" name="id" />
            <field column="Term" name="term" />  
            <field column="Name" name="name" />  
        </entity>  
    </document>  
</dataConfig> 

All the tests after insert or update will be queried under Admin SolR UI / collection1 …

Before we continue, Here are some useful URLs for Import / indexing MySQL database with Apache Solr – Delta Import to make the tests more quicker :

Clear or delete Solr index: http://localhost:8983/solr/update?stream.body=<delete><query>*:*</query></delete>&commit=true
Retrieve all: http://localhost:8983/solr/select?q=*:*&omitHeader=true
Index db: http://localhost:8983/solr/collection1/dataimport?command=full-import

http://localhost:8983/solr/collection1/dataimport?command=full-import&clean=false
Reload core: http://localhost:8983/solr/admin/cores?action=RELOAD&core=collection1
Query for word Sample: http://localhost:8983/solr/select?q=Sample&wt=json&qf=first_name%20last_name&defType=edismax

For indexing the whole database use :

http://localhost:8983/solr/collection1/dataimport?command=full-import

Now let us test SolR delta import by updating a row in items mysql table :

update item set `last_modified` = CURRENT_TIMESTAMP , price=91 where `ID` =1;
update users set `last_modified` = CURRENT_TIMESTAMP , user_name='Haddad' where `user_id` =1;

and let us do solr delta import with our dih :

http://localhost:8983/solr/collection1/dataimport?command=delta-import&optimize=false

It is important to note that some problems may occur with delta import if the server time is not set properly or on ID in DIH if the proper version of Solr is not the good one
DIH file for version lucene 1.4 is slightly different from 4.4 +++ versions

Import data with standard Admin UI solr web interface
solr_admin_ui

Cheers

extradrmtech

Since 30 years I work on Database Architecture and data migration protocols. I am also a consultant in Web content management solutions and medias protecting solutions. I am experienced web-developer with over 10 years developing PHP/MySQL, C#, VB.Net applications ranging from simple web sites to extensive web-based business applications. Besides my work, I like to work freelance only on some wordpress projects because it is relaxing and delightful CMS for me. When not working, I like to dance salsa and swing and to have fun with my little family.