Solr 7 beginner tutorial: dataimport from PostgreSQL database

 
Link to part two of this tutorial: delta-import and more on debugging Solr engine.

This tutorial was written for beginner Java developers and is meant to quickly set you up with Solr.

As such, it covers the following subjects: installing Solr, creating and configuring a core, setting up dataimport handlers to import documents from database, Java coding.

As usual, I'll try my hardest to make it example- and practice-driven, considering you can read all theory on the matter of Solr i.e. here:

So, here's what we'll do: first, we'll install a database and import lots of sample data.

Why? You can import documents (basic entities in Solr, compared to RDBMS those would be rows in table) to Solr from a multitude of sources, but the most popular use case is to store text data  (i.e. blog posts) in traditional database and import it to Solr to allow for quick search. Solr's index will really shine in scenarios where you want to do multi-criterion search (i.e. I need a book with title like this and this year of publishing that was published in Norway). Here's a nice link that describes key differences between traditional RDBMS index and Lucene index (which is a search engine underlying Solr).

Then we'll install Solr and create a core for our collection of sample data, define its schema (i.e. fields and their types and so on) and write a custom data import handler. The outcome of all this labor is that we'll be able to simply navigate to http://localhost:8983 and import thousands of documents from database upon a button click.

This tutorial assumes Arch Linux or some other Linux distribution is installed on your PC. You can do it on any other operating system though, it's just easier this way. To Windows users like me I'd recommend installing Oracle's VirtualBox and importing one of the many Arch-based distros, like Manjaro.

Log in screen of fresh Manjaro 17 VM

Log in with password osboxes.org. Let's kick off our session!

Database

I thought I'd give PostgreSQL a try, since it's the only RDBMS I've heard of in positive context. The installation is dead-simple, though initial configuration can be quite daunting if it's you first time with Postgres. I referred to Arch Wiki for setup recommendations. Open up the console (Terminal is the application's name)

Now that the database is set up, we can grab and import some kind of data for testing purposes. What I found after a couple of minute's worth of Google trawling is AdventureWorks Database - a Microsoft product sample for an online transaction processing (OLTP) database. The AdventureWorks Database supports a fictitious, multinational manufacturing company called Adventure Works Cycles.
Some nice person even uploaded a set of scripts that can import this to PostgreSQL -> here.
First, you'll have to download the original OLTP database from the link on the page I linked. Then clone the repository to your home directory:
All the contents of OLTP script archive have to be extracted to AdventureWorks-for-Postgres directory:

Now, launch the data transformation and importing scripts. You'll have to create a user with the same name as your UNIX user to allow importing from files in your home directory.
The huge collection of all sorts of entities is imported to PostgreSQL database Adventureworks. There's employees, transactions, orders and much more.
Let's check it out:
One thing worth noting is the schema of the table we're going to use for import. If you're familiar with the notion of schema in databases, Solr structures the data according to schemas (almost) just like any database. We'll have to know the schema of any dataset we'll want to import to Solr. The schema of the table person.person we'll use later can be found below:
We can get to the point now and install Solr.

Solr

Installing and creating a core

Download and unpack the latest version of Solr. At the time of writing this tutorial the latest version was 7.2.1. After unpacking it (to your home directory, for example) you'll have a fully functional full-text search engine ready to go:


What do those directories contain?
  •  /bin contains Solr executable.
  • /server/solr contains the cores.
  • /dist contains JAR libraries designed to enable extra features in Solr.
  • /contrib holds the additional files for the libraries mentioned above.  Namely, /contrib/{featureName}/lib contains the additional libraries needed to add this feature (i.e. JDBC driver lib for data import feature).
  • when in doubt about organizing your Solr collections, take a look at /example directory. Examples are made according to best practices.
Time to start examining the subject of cores. lsof program is additionally needed on Solr startup to determine if it's up or not.
Let's create a core to hold all the employees in Adwentureworks company: You can access your local Solr instance on http://localhost:8983/solr/
Note the 'data driven schema' warning. It refers to Solr's ability to infer the schema of collection (collection is synonymous with core in our one-server setup, read more about it here). If we were to simply continue using our core, the schema for the adwentureworks_person core (i.e. a set of fields each document has, their types and so on) would be determined at run time upon adding new documents. For instance, after creating a core we could add a new person to it (no need to actually do it, this is just an example): Solr would infer that collection adwentureworks_person requires fields "id", "name" and "title" typed number, text and text correspondingly, and add those fields to the schema. Looks convenient!... but this approach is rarely used in production as the schema is the blood contract between data layer and backend app.
That's why we'll disable this feature: We can get to the schema part of our tutorial now.

Schema 

Schema is the first thing we're going to configure in our collection. Schema defines:
  • what fields can document have,
  • which ones of them are mandatory,
  • what types do they have 
  • and how to perform indexing and searching on each of those documents.
The schema for our core is located in /server/solr/adwentureworks_person/conf/managed-schema. It used to be named schema.xml in old Solr versions, but introduction of Schema API (this is a must-read link) added a new type of schema, the managed schema, that can be edited at runtime via HTTP requests remotely.

For detailed explanation of Schema editing interface, see here. If you ever decide to edit you schema on the go without messing with config files, you'll need this.

Schema API also made editing fields from your browser possible! Like this:

Back in the days when Schema API wasn't introduced yet, you had to manually connect to server running Solr, rummage around the hard drive to find schema.xml and type in new fields/change existing fields. That's what most tutorials around the Internet suggest you do, in fact.

This tutorial will go the novel way, so we're going to configure the schema from browser entirely.

Now, as for the field types and properties, the official Solr manual (you should read this one, too) has the most concise description, as usual.

The most essential field types are as follows:
  • text_general: use it for all kinds of normal text data (i.e. names, addresses, book titles etc.)
  • string: use it for text that you don't want tokenized, for example, IDs.
  • pint, pints, plong, plongs: integer and long types (signed 32 and 64 bit respectively).
    p stands for pointer - meaning the internal Solr representation of number will allow for faster searching.
    s postfix means the field is multivalued. (Have you read the link from the paragraph above yet? Then you know what multivalued means.)
  • pfloat, pfloats, pdouble, pdoubles: 32- and 64-bit IEEE float and double types.
  • boolean: equals BoolField. As quoted from Solr docs: Contains either true or false. Values of 1, t, or T in the first character are interpreted as true. Any other values in the first character are interpreted as false.
Here's the full list of all types that are shipped with Solr 7: link.

Enough theory, let's create our own schema now! I'll show how to add only two fields from Postgres person.person schema: firstname and lastname. You can add all of them, if you'd like.
  1. Create a stored, indexed and required firstname field.
  2. Rinse and repeat for lastname
  3. There already exists an ID field, so we don't have to create another one:

That's all! You can check your managed-schema and see the fields you just added:
No need to restart Solr, the changes were applied immediately.

Data import.

Unfortunately, no one from Solr team thought of implementing data import handler editing from the browser, that much can be gleaned from the bare page of dataimport:

To get our table data into Solr, we'll finally have to edit some files.

  1. But first, you'll have to download PostgreSQL JDBC driver (postgresql-version.jar, I used postgresql-42.2.0.jar) and move it to contrib/dataimporthandler/lib/ (you'll have to create lib folder):
  2. Edit your solrconfig.xml in server/solr/adwentureworks_person/conf/ : insert the following lines there:
    Attention! XML tags are case-sensitive. Therefore mistyping tags <requestHandler>, <dataConfig>, <dataCource> etc. will lead to Solr not "seeing" those at all, i.e. data import won't work. Pay attention to capital letters.


  • Create a file named data-config.xml (as defined in the previous step) in server/solr/adwentureworks_person/conf/ directory with these contents:
  • Make sure dist/ directory contains libraries for dataimport: solr-dataimporthandler-version.jar and solr-dataimporthandler-extras-version.jar
  • Restart Solr:

  • The Web interface will contain data import section. Click Execute and voila!

    If your import query finishes immediately, something is wrong. When troubleshooting, logs help to locate the problem, and in Solr they can be accessed from browser on http://localhost:8983/solr/#/~logging. Look for errors in logs, double-check you configuration and restart Solr when the configuration is fixed to reapply changes.

    The error regarding dataimport is highlighted in red here.

    You can refer to detailed docs on data import in Solr here.

    Let's perform a test query. If import worked, we'll see plenty of data:

    Java.

     Initially, I wanted to include Java samples in this post, but since it's already enormously long I'll leave it for the next part. Stay tuned!

     Part 2 of the tutorial is ready with delta-import and more debugging hints! Sadly, still no Java code - I found it so simple and straightforward that I there's literally nothing to write about.

    Comments