Quantcast
Channel: SearchHub | Lucene/Solr Open Source Search » Documentation
Viewing all articles
Browse latest Browse all 10

Notes on DIH Architecture: Solr’s Data Import Handler

$
0
0
What the world really needs are some awesome examples of extending DIH (Solr DataImportHanlder), beyond the classes and unit tests that ship with Solr.  That’s a tall order given DIH’s complexity, and sadly this post ain’t it either!  After doing a lot of searches online, I don’t think anybody’s written an “Extending DIH Guide” yet – everybody still points to the Solr wikiquick start, FAQ, source code and unit tests.

However, in this post, I will review a few concepts to keep in mind.  And who knows, maybe in a future post I’ll have some concrete code.

When I make notes, I highlight the things that are different from what I’d expect and why, so I’m going to start with that.  Sure DIH has an XML config where you tell it about your database or filesystem or RSS feed, and map those things into your Solr schema, so no surprise there.  But the layering of that configuration really surprised me.  (and turns out there’s good reasons for it)

What I Expected to find (which is incorrect)
  • Top level: define connections to a database, website or filesystem, including your SQL join query, filesystem search pattern, or URL parameters.  I had seen the DataSource tags and assumed that’s what they were.
  • Mid level: define your document, the logical unit of retrieval, using the <document> tag.
  • Low level: the fields you want to copy or map into particular Solr fields with the <field> tag.

How the DIH config is actually structured:
  • Top level: you often define a DataSource here, though not always.  And even if you do define it here, it’s not where you put your SQL query!  Also, different data sources return radically different Java types, either readers or streams or complex nested data structures – I found this really surprising – that the same method defined in an interface would return such fundamentally different types of data.
  • Intermediate level: the document tag is near the top, and in a few cases there’s no DataSource tag.
  • Mid level: This is a combination of Entity Processors and Transformers.  But the Entities define the SQL commands or filesystem parameters, etc.  They can also be nested and have the “root” attribute which does NOT always correspond to the highest <entity> tag in the tree.
  • Low level: I was pretty close on the <field> tag, though it’s more powerful than I thought.


Entities Are King

The <entity> tag is really the star of the show!  In fact if you’re thinking about extending DIH for a custom repository, there’s some chance you might just create a new EntityProcessor instead of a DataSource.  For an example see the EmailEntityProcessor.

It’s jarring at first to see a SQL query subordinate to the <document> tag.  I think of “document” as an individual row in a database and search engine, whereas a SQL query is a whole set of rows, a results set.  Even if you allow for joins to ancillary tables to pickup a few additional fields, surely the main table (a bunch of rows) is logically above a document (an individual row) – it’s almost like a foreach loop where the top iteration code has accidentally been swapped with the code that acts on each item.

It turns out that for Entities flagged as “root”, each document does still correspond to an individual record – DIH obviously isn’t stuffing all of your records into an individual document! – think of it more like the SQL is defining a cursor that will instantiate a document for each row it encounters.  I think this was done for consistency with the SQL appearing in the NON-root entities, which does make sense being down at this level.

Non-root entities can run additional SQL queries, using values from the root entities current result, to query additional tables (or fetch other types of data to add to the record).  So for example if “book” is your main entity, another entity can query for authors and add one or more of them to the book.  This particularly powerful for one-to-many and many-to-many relationships.  If you did a normal join the main record would be repeated, and a book with 3 authors would wind up creating 3 Solr records instead of 1.

An interesting example of the top entity NOT being the root is the Wiki example for scanning your filesystem.  The top entity from an XML standpoint contains the information about which filesystem directory to scan and what pattern to look for.  But you wouldn’t want a bare directory listing to be an actual searchable document in Solr; this is handled by setting rootEntity=”false”.

What you really want is the contents of each file to become a record in Solr.  This is handled by a nested entity (from an XML standpoint) that is flagged as rootEntity=”true”.

Another example would be if a single XML file contains multiple records; in a complex situation you could nest XML processors, but still only have the one that should correspond to Solr documents be marked as root.

Eventually some of the entities will contain <field> tags.  This is where you map database or XML data into a Solr schema.  You can also invoke transformers for each field.  And what’s really nice is that field transformers have access to all the bound variables as well as the other field values you’ve just calculated, so you can build up rather complex fielded data.  The Tika processor can also handle other document types such as MS Word or Excel, which might also be living in your CMS database.


Deeper Nesting

And story gets even more interesting.  What if you had XML data that you wanted to handle in DIH, but it was stored in a database instead in the filesystem or a simple RSS feed?  And what if, in order go query that XML content from a database, you actually had to do joins to pull in Metadata from other tables?  Not a problem – DIH has you covered!  The top level entity would setup the query to the main database table, and you could have other entities that do joins to other tables.  And eventually when fetch the XML content, you can further nest CLOB and XML entity processors under there to pull out additional key values with XPath.  So DIH can combine SQL, XPath, Regex to cleanup, and any scripting language you’d like, all in one config file, often without needing to write any Java code – now THAT’S COOL.  I realize there are stand-alone ETL products that can do even more, but DIH is open source and already integrated into Solr, so it’s quite convenient to leverage.

I think this tremendous flexibility to represent nested business logic in entities and all the machinery to transform data is why DIH looks a bit daunting at first.  DIH has enough smarts to replace the “pipelines” of other products like FAST ESP.   As a bonus you’re actually encouraged to generate additional queries to pull in more data; although this was technically feasible in ESP, the practice was discouraged and you really had to “roll your own”.


Extending DIH

If you’re a Java coder and you really can’t assemble what you need from the existing components, then maybe it’s time to break out the compiler.  There’s quite a few Entity types and Transformers in the Wiki.

Also, depending on what you need, extending DIH might be overkill.  Erik Hatcher wrote a nice blog post about Indexing with SolrJ that might be easier.

If I still haven’t talked you out of this, then here goes!

The main dataimporthanlder code lives under the contrib modules area in Solr, and everything’s in the org.apache.solr.handler.dataimport package.  There’s also a second contrib module called dataimporthandler-extras for Tika and Email.; it’s separate in order to partially segregate dependencies.

As I said above, you might consider extending the EntityProcessor code instead of the DataSource code.  I’m still thinking about the implications of each.

There are four places to extend DIH from:
  • DataSource
  • EntityProcessor (use EntityProcessorBase)
  • Transformer
  • Evaluator

There are three main families of DataSources to consider, which I could by their return type:
  • Binary Data – returns <InputStream>
  • Textual Data – returns <Reader>
  • Record-based Data – returns Iterator<Map<String, Object>>>

Some other classes to be aware of:
  • ContextImpl tells you everything about the inbound request, your schema, etc.
  • EventListener – tap into everything that’s going on, create callbacks, etc.
  • VariableResolver and Custom Evaluators – in case you have some legacy template language (you wouldn’t extend VariableResolver, but good to be aware of it)
  • DIHWriter, SolrWriter and DocBuilder get your data into Solr (you wouldn’t extend these, but also good to know about)
  • MockDataSource is under the main code branch (vs. test), and returns Record-like data (<Iterator<Map<String, Object>>>) – this might be helpful if that’s the sort of data you’re working with
  • AbstractDataImportHandlerTestCase is over in the unit test source code and is a good place to start when writing your tests.  You might also look at the other tests that make use of it to get some ideas.

I’d love to hear what you’re doing with DIH, especially if you’ve actually extended it.


Viewing all articles
Browse latest Browse all 10

Latest Images

Trending Articles





Latest Images