Using Solr (lucene) for search with Ruby on Rails
Well at this point we’ve used just about all the common search solutions over at Vodpod.com.
We started with Ferret, which was simple to get going and worked pretty well, but doesn’t scale out to a cluster very easily.
Then we moved on to Sphinx using Ultrasphinx plugin from Evan Weaver. This setup works great. Indexing is super fast, and the searchd daemon works from our cluster very well. It turns out to just be easier to do a full reindex pretty often (rather than trying to keep the index up-to-date), and sphinx makes this easy. I recommend Sphinx very highly, and I guess the fact that they’re using it on Craisglist is probably a pretty good endorsement as well!
And if we only had site search, we would have left it at Sphinx. However, we also use search to generate our related videos list next to each video. We used Sphinx, and this works pretty well, but we wanted to boost videos by date to favor more recent stuff. Unfortunately, Sphinx doesn’t seem to support date boosting when doing an OR query, so this wasn’t possible for our related videos.
So recently we decided to implement Apache Solr to see if we could get better related videos. Having left Java behind when I moved to Rails a few years back, I was in no hurry to start running Java again! And indeed, where implementing Ferret and Sphinx were single afternoon projects, getting Solr running took more like a week! Ah well, such are the joys of Java world.
Getting Solr Running
First don’t make the mistake I did and install ubuntu packages. At least, don’t get Solr 1.3.0 (when you read this there may be an updated release). You want the Solr Nightly Builds. The reason is that the nightly builds contain a new component called the DataImporter. This component allows you to build your index by querying directly from Mysql. The standard Solr approach is to build your index by POSTing XML docs to the Solr server. Ick! (I assume this approach is used by acts_as_solr, which is why I avoided that plugin).
So unpack the Solr nightly build. It convenient includes the Jetty web server as a container so you don’t need Tomcat. I’m running Jetty in production, so I’ve avoided Tomcat altogether. To build, you’re gonna need the Java JDK (get the JDK with the compiler, NOT the jre). Now you also need Ant.
Go ahead and run “ant example“. If you’ve got everything, it will build a sample Solr installation in your “example” directory. Go in here and run: “java -jar start.jar”. Now Solr is running, and you interact with a simple admin interface at http://localhost:8983/admin.
Index Your Data
Now that Solr is running, you’ll want to get setup to index the data from your database. First of all, get the JDBC driver for Mysql. All you need from that is the mysql-connector-XX.jar file. Drop this file into <solr>/lib. Now compile Solr again and it will pack that jar into your WAR file.
Now, working in your example directory, edit example/solr/conf/solrconfig.xml. You need to add this handler to activate the DataImporter component.
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
<lst name="defaults">
<str name="config">data-config.xml</str>
</lst>
</requestHandler>
Define your Index
To get Solr to index your data takes two basic steps. First is to defiine the schema of your index. This just defines all the possible fields and their types in the index. The second step is then to define your SQL query and map result columns to the fields in the index schema.
There’s a bunch of details to the schema. Look in example/solr/conf/schema.xml for instructions. I deleted the example fields defined in there and added my own:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text" indexed="true" stored="false" required="true" termVectors="true" /> <field name="tags" type="text" indexed="true" stored="false" required="false" termVectors="true" /> <field name="created_at" type="string" indexed="true" stored="false" required="false" />
Now, you need to create the config file for the DataImporter (as specified in the requestHandler block above). So create a file in the same dir called “data-config.xml”. This is what mine looks like:
<dataConfig>
<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://dbserver/myrailsdb" user="dbuser" password="dbpassword" batchSize="10000" />
<document>
<entity name="video" pk="id" query="select videos.id,videos.title,videos.created_at">
<field column="id" name="id" />
<field column="title" name="title" />
<field column="created_at" name="created_at" />
</entity>
</document>
</dataConfig>
So dataSource defines your connection to the db. Then entity defines your SQL query. And finally each field maps a column from the result set to a field in your index as defined in schema.xml.
Get Indexing!
Ok, with all this in place, restart Solr, then go hit:
http://localhost:8983/solr/admin/dataimport.jsp
This gives you a primitive interface to the DataImporter. Go ahead and click Full Import to build your index. You should be tailing the Solr log to look for any errors. Once your index is built, then you can use the regular Solr admin page to run queries against it. If you aren’t familiar with Lucene, you’re gonna want to understand the difference between indexed and stored fields (you only get stored fields back in your results). I recommend storing everything to start so you can see the results immediately when you query. In production however, I’m only storing id’s, and then looking up each result record in the database.
Using Solr from Rails
Creating the client code to talk to Solr turns out to be super easy. Basically you get use HTTP GET to call the server, and pass the wt=ruby argument to get Ruby-formatted results. Then just call “eval” on the results, and you get a Hash like this:
{‘response’ => {‘numFound’ => total results, ‘docs’ => [array of doc Hashes]}
Each of my ‘doc’ Hashes includes an ‘id’ field which is the ID of my Rails record. To be cool, I lookup all these records with one query:
ids = solr_data['response']['docs'].collect {|doc| doc['id']}
Video.find_by_sql(“select * from videos where id IN (#{ids.join(‘,’)}) order by FIELD(id,#{ids.join(‘,’)})”)
Conclusion
I’m happy with the results we get from Solr (although you need to read my next post about how I hacked Solr to get better results). Indexing is also quite fast (like 7 minutes for 1 million docs, although this seems to slow if the server is under load). However, search time varies between quite fast (10ms) to unacceptably slow (over 1 sec), although this is largely due to the complicated date boosting we do in our queries. So I ended up caching the results in my db to avoid having to call Solr for every action. However, our Sphinx setup runs so well that for now we are keeping it for site search.
I use acts_as_solr for search the site with photos (~300000 records). Indexing is quite fast, but not as your numbers. For video site I need similar videos functionality, which made me dig deeper (acts_as_solr for the time of writing seems does not support syntax for mlt). Your article is an eye opener :) Thanks!
Побольше бы тематической информации, и будет респект полный.
i couldn’t install use acts_as_solr plugin. Please tell me how it is?
Вроде все в порядке, только по ссылке почему то не могу перейти
i use acts_as_solr plugin on windows, but when I run “rake solr:start” or “rake solr:start_win”, i have got problem. That is :
rake aborted!
uninitialized constant Net