Hacking Solr with Rails - part 2

scottp March 9th, 2009

In our last exciting episode, we talked about getting Solr running with Rails. My specific requirement was to generate a list of videos related to another video.

Fortunately, Solr makes this super easy with its MoreLikeThis handler. Building on a class from the Lucene library, that handler makes it really easy to say, “given doc X in my index, return my other docs like that one”. It does this by building a query based on your subject doc. The query it builds is smart about obeying the tf/idf of each term to get you good results.

However, the MoreLkeThis handler doesn’t give you any control over the ordering of your results - you just get relevance order. However, I wanted to boost more recent videos over older ones. I didn’t want to sort my results by date, because that potentially puts bad matches at the top. Instead I just wanted to boost more recent results so they generally appeared towards the top.

Now Solr includes a general search handler called the DisMaxRequestHandler which supports this cool bf argument, by which you can provide a boost function. Essentially the function will get executed for every document and its result multiplied against the relevance score of the doc to arrive at its overall score. This works really nicely, and there’s even an example in the documentation showing how to boost by recency.

Unfortunately, the MoreLikeThis handler doesn’t have that boost function support. So I decided to hack it in. Very simply, I just copied over the code from the DisMaxQueryHandler that handles the “bf” argument into the MoreLikeThisHandler. Under the covers, both handlers build a Lucene query to apply to your index, so the mod to MoreLikeThis just required building the query parts from the boost function and adding them to the “more like this” query. Then I just had to play around somewhat with my boost function to get results in the order I liked them.

I don’t pretend to have made this mod in the cleanest or best way possible. But it is nice that Solr/Lucene allow this kind of hacking, even though it’s a lot harder than hacking something like Rails! I’ve attached my modified MoreLikeThisHandler.java file to this post just in case it comes in handy for anyone else.

MoreLikeThisHandler.java

Using Solr (lucene) for search with Ruby on Rails

scottp March 9th, 2009

Well at this point we’ve used just about all the common search solutions over at Vodpod.com.

We started with Ferret, which was simple to get going and worked pretty well, but doesn’t scale out to a cluster very easily.

Then we moved on to Sphinx using Ultrasphinx plugin from Evan Weaver. This setup works great. Indexing is super fast, and the searchd daemon works from our cluster very well. It turns out to just be easier to do a full reindex pretty often (rather than trying to keep the index up-to-date), and sphinx makes this easy. I recommend Sphinx very highly, and I guess the fact that they’re using it on Craisglist is probably a pretty good endorsement as well!

And if we only had site search, we would have left it at Sphinx. However, we also use search to generate our related videos list next to each video. We used Sphinx, and this works pretty well, but we wanted to boost videos by date to favor more recent stuff. Unfortunately, Sphinx doesn’t seem to support date boosting when doing an OR query, so this wasn’t possible for our related videos.

So recently we decided to implement Apache Solr to see if we could get better related videos. Having left Java behind when I moved to Rails a few years back, I was in no hurry to start running Java again! And indeed, where implementing Ferret and Sphinx were single afternoon projects, getting Solr running took more like a week! Ah well, such are the joys of Java world.

Continue Reading »

Mysql Replication Adapter bug with Rails query cache

scottp January 13th, 2009

[Note: this post relates to our using Rails 2.0.2. The fix provided below may not work in other versions of Rails.]
We recently started using the Mysql Replication Adapter from the RapLeaf guys to enable us to off-load some queries to our mysql slaves.

The adapter is easy to install, configure and use, which is why we chose it over other possibilities.

However, today we found a little bug, and we traced it to the interaction of the Rails 2 query cache with the mysql_replication adapter. The short version is that when using the mysql_replication adapter, the query cache doesn’t get purged even if within your Rails action you perform an insert/update/delete operation. The normal adapter has the behavior that any operation to modify data clears the query cache. The bug we were seeing looked something like this:

  items = user.items
  # items is empty because user is new
  user.items << Item.new(...)
  ...
  items = user.items(true)
  ## ACK!! items is STILL empty. We should see that item we just created.

So even though we were passing force_reload=true to our association finder, the query was still hitting the query cache which had previously returned an empty set.

The mysterious alias_method_chain

Turns out that the problem is the result of different approaches to class extension used by Rails core versus the Rapleaf adapter. Rails relies heavily on the alias_method_chain approach. This approach basically replaces method A in a class with a new method B, but renames the old method and calls the new method ‘A’. Then the new method is responsible for invoking the old method. The result is that now when I call method ‘A’ on the class, it invokes the new method first, then the old method.

Rails uses this technique to implement the query cache. In particular, it method-chains each of the “insert”, “update”, and “delete” methods on the MysqlAdapter class to call a method to purge the query cache.

Now the RapLeaf code uses what I consider a more “traditional” approach (or maybe you’d call it a Java-like approach) to class extension, which is it subclasses the MysqlAdapter class. In the subclass, a number of methods are overridden. In particular, the same “insert”, “update”, and “delete” methods are overridden.

So what happens when you call MysqlReplicationAdapter.insert ? Turns out you just call the specialized version of the method, and the query-cache method chaining stuff in the parent class gets ignored. Basically the two versions of class extension aren’t cooperating.

Now the proper fix is probably to re-code the replication adapter to use method chaining instead of subclassing. But that’s not a trivial task! Fortunately, I was able to more easily fix the problem by just replacing the “insert/update/delete” methods of the replication adapter with these methods:

      def insert_sql(sql, name = nil, pk = nil, id_value = nil, sequence_name = nil) #:nodoc:
        ensure_master
        super sql, name
        id_value || @connection.insert_id
      end

      def update_sql(sql, name = nil) #:nodoc:
        ensure_master
        super
        @connection.affected_rows
      end

      def delete_sql(sql, name = nil)
        update_sql(sql, name)
      end

These XX_sql methods are sort of the internal versions of the public facing methods. Fortunately its the public-facing ones that query_cache method chains, and so these internal ones can be safely overridden.

Better configuration for your Rails app survives reload!

scottp January 9th, 2009

Historically we’ve added ad-hoc config for our rails app by jamming simple Ruby constants in our environment.rb files. Like this:

RECAPTCHA_PUBLIC_KEY = 'foobar'

Ick! We’ve quickly grown to have more than ten of these, and they are really yucky. So recently we’ve tried cleaning these up by adding our own simple configuration class:

module Remixation
  class Configuration
    cattr_accessor :recaptcha_key
  end
end

Ah…nice. Now we can change our config files to:

Remixation::Configuration.recaptcha_key = 'foobar';

this seems cleaner. Easier to manage our various config values since they will be defined in our class. It also means we can update values without generating the dreaded Ruby “redefined constant” warning.

Initially I had put this class under lib, but we quickly ran into a problem. Every time the app reloaded (which in dev is every request) our values would get wiped out. This is because the class def gets reloaded, but our environment statements to set the values don’t get re-run.

Our eventual solution involved moving our class into RAILS_ROOT/config/remixation, then adding these lines to our environment.rb config block:

config.load_paths += %W( #{RAILS_ROOT}/config/remixation )
config.load_once_paths += %W( #{RAILS_ROOT}/config/remixation )

These tell Rails (v2.1) NOT to reload classes from this new directory. I’d be happy to hear if people have found more elegant ways of handling their app config.

Rails - your pages are still slow! (part 2)

scottp May 21st, 2008

Here at Vodpod we like to think that we’re clever, but that doesn’t mean we’re smart!

Back in January we posted about how to modify Rails to fix a problem with asset tags. The problem was that asset tags coming from different servers would get different ‘asset codes’ appended to them, and thus look like distinct files to the browser. We solved that problem so that the browser sees a consistent URL for files that are the same. I even set myself up with this nice claim at the end:

“In between deploys the browser can happily cache our assets.”

Ah, how misguided we were back four months ago. The problem is that our static files didn’t have the proper cache expiry headers to allow the browser to cache them. Within a session, if you’re lucky, your browser may cache a static file or two. But without explicitly telling the browser that the file doesn’t need to be refetched any time soon, it will pretty quickly go back and ask for that file again.

The solution is simple, you need to add the ‘Expires’ and ‘Cache-Control’ headers to your static resources. These tell the browser that your resource will not change for X amount of time, and thus the browser can safely cache it until then. Now since Rails is gonna change the cache-buster code every time we deploy, this expiry time can in effect be infinite.

Now this seems like the kind of thing you would hope that Rails would give you out-of-the-box. The problem is that Rails doesn’t expect to serve static files, that is supposed to be done by your web server. We use Apache, and Apache makes it very easy to add these headers. But it doesn’t do it automagically, you gotta configure it.

Step 1. Make sure you’ve got mod_expires available. I had to compile the so, and configure httpd.conf:

LoadModule expires_module modules/mod_expires.so

Step 2. Tell Apache to add the expiry headers

ExpiresByType image/png “access plus 1 month”
ExpiresByType image/gif “access plus 1 month”
ExpiresByType text/css “access plus 1 month”
ExpiresByType application/x-javascript “access plus 1 month”

You may have to play with the mime types depending on your setup. These directives are for Apache 2.2.x. Check the manual if you’re running something different.

Nice and speedy. This only affects load times for repeat visitors, but assuming you’ve got a lot of those, this should reduce overall connections quite a bit. I strongly recommend people read through the official documentation of the cache control headers: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.9.
It’s lovely nighttime reading!

Another Rails app

scottp May 21st, 2008

I’ve launched a new rails app, a little side project. It’s a site for finding excellent toys for boys.

This one is very niche - it’s only gonna be useful if you need to buy a present for your son, grandson or nephew. I’ve got 3 sons myself so finding them good, quality toys is a regular concern for me. I’m trying to aggregate comments and reviews of the best toys around.

Loading a Rails session by hand - Flash uploads

scottp March 18th, 2008

Here is some code I worked out a while ago for loading a Rails session manually inside an action.

This is useful for example when doing uploads from Flash, since the Flash runtime does not pass on the Rails session id cookie properly. So instead you have to pass the session id as a normal parameter, and load the session by hand.

opts = {:session_key => 'session', :session_id => params[:session]}
opts = opts.merge(request.class.const_get('DEFAULT_SESSION_OPTIONS'))
sess_opts = opts.inject({}) { |options, (k,v)| options[k.to_s] = v; options }
real_session = CGI::Session.new(request, sess_opts)

Update

Here’s another approach on solving this problem by patching CGI::Session:

http://blog.inquirylabs.com/2006/12/09/getting-the-_session_id-from-swfupload/

Rails - why your pages load so slowly!

scottp January 16th, 2008

A while back we spent some time optimizing the load speed for our main video page. One thing we noticed was that Rails has this habit of tacking on a “cache buster” integer to the end of static asset paths when you use one of the asset tag helpers like “javascript_include_tag”. The problem was that the cache buster integer changed as we visited the same page.

Well if you go look at the source code, the reason is clear:


      asset_id = rails_asset_id(source)
      source << '?' + asset_id


      def rails_asset_id(source)
          ENV["RAILS_ASSET_ID"] ||
            File.mtime("#{RAILS_ROOT}/public/#{source}").to_i.to_s rescue ""
        end

So in its wisdom, Rails uses the mtime of the file to generate the ‘asset tag id’. This has the nice effect that whenever the file changes, then you get a new id for the file and the browser knows it needs to reload it. The problem is that we run a cluster of web servers, so each server generates
a different id. So the same file actually appears as different files from each server.

The net result is to kill page load times, especially if your page has multiple Javascript files. The browser keeps seeing a different file, so it can’t just use the one it has cached.

Fortunately Rails gives you an out, you can set the asset ID yourself using the RAILS_ASSET_ID constant. But how do we set this value? Well, we want it to change whenever the file changes. A good proxy is that latest SVN revision.

So what we did is wrote a rake task to dump the SVN revision to a file:


  desc "Writes latest svn update number to config/svn_version for use as asset tag id"
  task(:svn_version => :environment) do
    lines = `svn log -r HEAD`
    if lines =~ /(r\d+)/
      f = File.open("config/svn_version", "w")
      f.write($1)
      f.close
    end
  end

Now we just add some code to environment.rb to read this file on startup into our constant:

# Setup the ENV["RAILS_ASSET_ID"] so that our resources look the same on every machine. This
# assumes that rake remix:svn_version has been run on each machine
if File.exist?("config/svn_version")
File.open("config/svn_version", "r") {|f| ENV["RAILS_ASSET_ID"] = f.readline }
end

Finally, we added a call to our remix:svn_version task to our deploy scripts. Now whenever we deploy, we rev the asset id and the browser knows to reload files that (may) have changed. In between deploys the browser can happily cache our assets.

Blog Importer

phil January 10th, 2008

There’s a new feature on Vodpod that allows you to keep your Vodpod account in sync with your blog. Simply click on “Add a video to this pod” from your pod’s home page, then fill in the field that asks for your blog’s web address:

Blog Importer Screenshot

After that any videos that you post to the specified blog will be imported into your pod. To turn off the importing, you can go into your pod settings and delete the pod from the list of imports. Blog importing will work with any blog that advertises an RSS feed, which should be most of them.

Rails and Memcache timeouts

scottp December 14th, 2007

Lately we’ve been having trouble with our rails processes hanging when talking to memcached. The symptom was Mongrel killing our request after 60 secs, and the stack trace would show we were hung up trying to talk to memcache.

This seems to only happen when our db is slow in responding. Seems like perhaps dead sockets to memcache were not getting cleaned up.

Anyway, I found the Rapleaf guys had seemed to see similar trouble. They described adding timeouts to the memcache-client to fix the problem.

So we tried something similar and it did in fact seem to help. I have attached the modified memcache-client code to this post. The modifications are mostly at the top. I added a facade TCPTimeoutSocket class which wraps the normal TCPSocket class with timeouts.

memcache.rb

Next »