In between watching the “balloon boy” news today, we managed to get launched an upgrade to our search infrastructure. For quite some time we’ve been running Sphinx for search, and a small Solr implementation just to return videos related to another video.
Today we’ve upgraded to running all of our search functions on Solr. Although we’re big fans of sphinx, we decided it was more work to support two separate systems, and Solr just proved more flexible overall.
That said, getting converted over to Solr took about 5 times as long as our original sphinx implementation. That was both a function of Solr’s complexity, and the complexity of adapting our existing code. Solr really is a more powerful search solution, but requires significantly more investment at the beginning to get running.
Vodpod has, for a while now, allowed users to set up “video imports”, which monitor some external resource and bring into the user’s account any videos it finds. Most of them are based on RSS feeds, since that’s a pretty universal way that web services expose content. Until now, the only way to monitor these feeds was by polling, at most once per hour. Now, all feeds that support RSS Cloud (currently only WordPress) are no longer polled, but have updates pushed as they happen. This means that if you have a WordPress blog that’s being monitored by Vodpod for new videos, you no longer have to wait up to an hour for new videos that you post on your blog to be sucked into Vodpod. Just make the post, and the video will likely be in your Vodpod account before you can even type vodpod.com into your browser.
This is really a win for everyone involved. We get to stop polling over 5,000 of our feeds, WordPress gets to stop handling all those polls (even if north of 95% of them are 304s), and our users get instant instead of unpredictably-delayed results.
With the release of Vodspot comes the public beta of our new API service. I’m here to talk a little bit about some of the technology that powers the new API core.
The API is in a unique position in the Vodpod system. The old API was built into our main Vodpod Rails application, which has dozens of required libraries and hefty memory footprint. We wanted the API to be as small, fast, and modular as possible. In fact, since we have no need for templating, URL routing, formatting helpers, or even a full controller architecture, using a large MVC framework could have been overkill.
Ramaze was a natural choice, for three reasons:
- It’s built on Rack, a ruby webserver interface which is quickly becoming the way to glue together ruby web applications. Rack provides pretty much all the request processing we need to handle HTTP endpoints, and allowed us to write dedicated middleware for error handling and statistics.
- Ramaze is compact. We didn’t need to pull in a templating engine, layout system, or file server to get the job done. It doesn’t get in your way when it comes to requiring files or setting up the database, which means we were able to build the API as both a library for inclusion in other applications (such as Laminate), and as a full-fledged web application.
- It provides a slew of useful builtins without much cruft. We took advantage of easy-to-configure Rack middleware, support for basically all Ruby HTTP servers (we chose Thin), centralized logging, controller aspects, and awesome Memcache support.
For database work, we chose Sequel, a Ruby ORM with great support for sharding and other multiple-DB setups. Sequel is a perfect match for our API because it models queries as Datasets: extensible objects encapsulating all the information about a query with chainable modification methods. That lets us take our basic Models and filter, sort, combine, and otherwise transmogrify the SQL queries in several ways, all before any queries are made.
We took advantage of the Sequel plugin system to write custom serializers for XML and JSON support, using the Ruby JSON library and LibXML2 bindings. We also implemented much of the request processing code as methods on Sequel datasets: HTTP and Laminate function parameters like sort and tags are implemented across all our models as dataset methods.
Moreover, Sequel’s powerful query language let us accomplish things that we had to write in SQL by hand using ActiveRecord: many-through-many associations across n tables, filterable unions on multiple copies of the same dataset, and arbitrarily grouped tagging joins. All of this makes for more modular, reusable code.
These models are glued together by the API core, which provides the functions essential to Vodspot: videos(), user(), and so forth. All responses from the core are Sequel Models or Arrays (with a few extensions), supporting #to_json, #to_xml, and #to_hash. The appropriate serialization method is used by the caller, depending on their needs.
Laminate, our templating system for Vodspot, is actually bound directly to the API core via rufus-lua. Return values are transformed into Lua tables with #to_hash and presented to the templating engine for presentation.
HTTP requests are where Ramaze comes in: when configured as an app, the API sets up a Ramaze controller on top of the API core. It essentially transforms URLs like /users/spencer/collection/spencerpod/videos?tags=python&offset=5 into calls like videos('spencer', 'spencerpod', :tags => 'python', :offset => 5). We’re using the aspect helper to handle user identificiation via api_key and auth_key, and to check the cache. We use Ramaze’s provides function to format responses depending on the path extension, and to populate the cache.
Finally, some lightweight Rack middleware responds to errors that are raised anywhere in the application, formatting them based on the request path.
There’s our API architecture in a nutshell. You can see it in action at http://api.vodpod.com/v2.
With today’s announcement of our new VodSpot 2.0 product, we are also introducing a new page templating system we call Laminate. We built Laminate for the purpose of allowing our VodSpot users to modify the template pages used to generate their video sites.
Laminate is very similar in purpose and motivation to the Liquid Template system. Both systems aim to offer an HTML-based templating system that is safe to execute user-written templates. However, where Liquid introduces its own syntax, Laminate takes a different approach.
Laminate works by binding the Lua language runtime into the Ruby runtime. Lua is not super widely known, but it was purpose-built to be an embedded language. It sees heavy use today as the scripting language for World of Warcraft and Adobe Lightroom.
Even better, Lua is a very simple language. Here’s a basic “print hello world ten times” program:
for i=1,10 do
print "hello world"
end
By embedding Lua into Ruby, we get a full-featured programming language that also executes in a nice sandbox where it can’t do anything malicious. Inside one of our Laminate templates, this would just look like:
{{ for i=1,10 do }}
<h3>hello world</h3>
{{ end }}
So in comparison to Liquid, I think Laminate offers a more powerful templating language. That may be good or bad depending on your circumstances. For our needs we felt that in order to offer truly powerful customization to our VodSpot users, including the ability to build whole new functions and access data that we might not even have envisioned, we wanted to offer a “full” progamming language to the template writer. Note that this does come at a possible cost in reliability. Liquid is pretty guaranteed since its 100% Ruby, while running arbitrary Lua inside your Ruby interpreter definitely poses some additional risks.
Laminate is built on the Lua->Ruby integration library Rufus-Lua which was created by John Mettraux. A big thanks goes out to John who spent tons of time extending Rufus-Lua with new features for us.
We are providing Laminate open source under the MIT license. You can check it over here:
http://github.com/scottpersinger/laminate/tree/master
The README up there has much more information, including installation instructions. Please note that I am still working with John to integrate into Rufus-Lua changes that Laminate relies upon. We should get everything worked out over the next day or two.
Finally, you can check out the Laminate wiki that we created to support our VodSpot product. It’s got a bunch more useful information and shows how we are using Laminate in a real product.
Today we’re launching a major upgrade to our VodSpot product. VodSpot lets you build your own dedicated video site very easily by feeding it the video collection you create at Vodpod.com. Until today, VodSpot was a “click to customize” type of deal. You chose from a preset set of templates, and then each template had configuration options for colors and fonts, plus some “modules” you could drop onto certain pages. The whole thing was very easy to use, but the result was pretty limited in terms of customization.
With today’s launch, we’re introducing a new templating system called Laminate. With this system, your VodSpot template contains 90% straight HTML and Javascript which you are free to change however you like. What this means is that VodSpots are truly open now to complete customization. You can change any page, any element of your site. You can add completely new pages. You can even import data from additional sources (like a second Vodpod collection) and show that data on your pages. You can integrate with secondary commenting systems, and so on. The Laminate templates are built on the Vodpod API 2.0 which we are also launching today. This is a major upgrade to the old Vodpod API. If you really want to host a site yourself, the new API makes it even easier to leverage videos you collect into your Vodpod account.
To go along with the new template system, we’re launching a set of new templates. These templates are standardized now on using jQuery, which makes it very easy to add new cool page interactions by dropping in new jQuery plugins.
When you go create a VodSpot now, there is a new “Template Editor” option in the VodSpot dashboard. This is a full-blown code editor right in your browser that lets you edit any of the files that make up the template for your site. We think the new system is pretty cool, and we’re excited to see what kinds of things people build with it.
We’ve got a bunch of live VodSpot sites already including:
http://video.techcrunch.com -> Techcrunch video site
http://tpmtv.talkingpointsmemo.com -> Videos for the talkingpointsmemo.com news site
Sign up for a VodSpot at: http://vodspot.tv
Learn more about the new template system over here: http://wiki.vodpod.com/
We are planning to launch an upgrade to the Vodpod API in September. The new API will be faster, more complete, and hopefully easier to use. Along with the new API we are going to start releasing client libraries, although the schedule for these is not quite set.
I want to get the word out now however, so that if anyone is building new stuff on the Vodpod API, you should consider trying to work with the new version. We will have the new version up for sandbox access pretty soon, so drop me a line (using “scott”) if you want to get early access.
Update: And here it is.
Ok, now I’m reasonably pissed off. We recently upgrade from Rails 2.0.2 all the way to Rails 2.3.2 – admittedly a pretty big jump. It look quite a bit of time to get everything working again.
And then today we noticed that the widgets serving from vodpod.com didn’t seem to be getting cached properly. I went and looked at the cache headers, and sure enough:
Cache-Control: max-age=600, private
Doh! Private? Yeah, that’s gonna kill the caching. Now, private is the default in Rails, so I knew we had code that override that setting. It looked like this:
expires_in 10.minutes, :private => false
Hmm..that used to work. So I go check the Rails 2.3.2 docs for “expires_in”:
expires_in 3.hours, :public => true
W! T! F! “public => true”! I go check the source, and sure enough, :private is totally ignored.
Just to be clear here, this is BAD BAD BAD BAD!!. Reversing the option logic for a public API, just for fun. No deprecation warning, nuthin. This is bad form, is guaranteed to break people’s apps, and to do it silently to boot. Whoever checked this change in should get banned from Rails commits for a release cycle.
In our last exciting episode, we talked about getting Solr running with Rails. My specific requirement was to generate a list of videos related to another video.
Fortunately, Solr makes this super easy with its MoreLikeThis handler. Building on a class from the Lucene library, that handler makes it really easy to say, “given doc X in my index, return my other docs like that one”. It does this by building a query based on your subject doc. The query it builds is smart about obeying the tf/idf of each term to get you good results.
However, the MoreLkeThis handler doesn’t give you any control over the ordering of your results – you just get relevance order. However, I wanted to boost more recent videos over older ones. I didn’t want to sort my results by date, because that potentially puts bad matches at the top. Instead I just wanted to boost more recent results so they generally appeared towards the top.
Now Solr includes a general search handler called the DisMaxRequestHandler which supports this cool bf argument, by which you can provide a boost function. Essentially the function will get executed for every document and its result multiplied against the relevance score of the doc to arrive at its overall score. This works really nicely, and there’s even an example in the documentation showing how to boost by recency.
Unfortunately, the MoreLikeThis handler doesn’t have that boost function support. So I decided to hack it in. Very simply, I just copied over the code from the DisMaxQueryHandler that handles the “bf” argument into the MoreLikeThisHandler. Under the covers, both handlers build a Lucene query to apply to your index, so the mod to MoreLikeThis just required building the query parts from the boost function and adding them to the “more like this” query. Then I just had to play around somewhat with my boost function to get results in the order I liked them.
I don’t pretend to have made this mod in the cleanest or best way possible. But it is nice that Solr/Lucene allow this kind of hacking, even though it’s a lot harder than hacking something like Rails! I’ve attached my modified MoreLikeThisHandler.java file to this post just in case it comes in handy for anyone else.
Well at this point we’ve used just about all the common search solutions over at Vodpod.com.
We started with Ferret, which was simple to get going and worked pretty well, but doesn’t scale out to a cluster very easily.
Then we moved on to Sphinx using Ultrasphinx plugin from Evan Weaver. This setup works great. Indexing is super fast, and the searchd daemon works from our cluster very well. It turns out to just be easier to do a full reindex pretty often (rather than trying to keep the index up-to-date), and sphinx makes this easy. I recommend Sphinx very highly, and I guess the fact that they’re using it on Craisglist is probably a pretty good endorsement as well!
And if we only had site search, we would have left it at Sphinx. However, we also use search to generate our related videos list next to each video. We used Sphinx, and this works pretty well, but we wanted to boost videos by date to favor more recent stuff. Unfortunately, Sphinx doesn’t seem to support date boosting when doing an OR query, so this wasn’t possible for our related videos.
So recently we decided to implement Apache Solr to see if we could get better related videos. Having left Java behind when I moved to Rails a few years back, I was in no hurry to start running Java again! And indeed, where implementing Ferret and Sphinx were single afternoon projects, getting Solr running took more like a week! Ah well, such are the joys of Java world.
[Note: this post relates to our using Rails 2.0.2. The fix provided below may not work in other versions of Rails.]
We recently started using the Mysql Replication Adapter from the RapLeaf guys to enable us to off-load some queries to our mysql slaves.
The adapter is easy to install, configure and use, which is why we chose it over other possibilities.
However, today we found a little bug, and we traced it to the interaction of the Rails 2 query cache with the mysql_replication adapter. The short version is that when using the mysql_replication adapter, the query cache doesn’t get purged even if within your Rails action you perform an insert/update/delete operation. The normal adapter has the behavior that any operation to modify data clears the query cache. The bug we were seeing looked something like this:
items = user.items
# items is empty because user is new
user.items << Item.new(...)
...
items = user.items(true)
## ACK!! items is STILL empty. We should see that item we just created.
So even though we were passing force_reload=true to our association finder, the query was still hitting the query cache which had previously returned an empty set.
The mysterious alias_method_chain
Turns out that the problem is the result of different approaches to class extension used by Rails core versus the Rapleaf adapter. Rails relies heavily on the alias_method_chain approach. This approach basically replaces method A in a class with a new method B, but renames the old method and calls the new method ‘A’. Then the new method is responsible for invoking the old method. The result is that now when I call method ‘A’ on the class, it invokes the new method first, then the old method.
Rails uses this technique to implement the query cache. In particular, it method-chains each of the “insert”, “update”, and “delete” methods on the MysqlAdapter class to call a method to purge the query cache.
Now the RapLeaf code uses what I consider a more “traditional” approach (or maybe you’d call it a Java-like approach) to class extension, which is it subclasses the MysqlAdapter class. In the subclass, a number of methods are overridden. In particular, the same “insert”, “update”, and “delete” methods are overridden.
So what happens when you call MysqlReplicationAdapter.insert ? Turns out you just call the specialized version of the method, and the query-cache method chaining stuff in the parent class gets ignored. Basically the two versions of class extension aren’t cooperating.
Now the proper fix is probably to re-code the replication adapter to use method chaining instead of subclassing. But that’s not a trivial task! Fortunately, I was able to more easily fix the problem by just replacing the “insert/update/delete” methods of the replication adapter with these methods:
def insert_sql(sql, name = nil, pk = nil, id_value = nil, sequence_name = nil) #:nodoc:
ensure_master
super sql, name
id_value || @connection.insert_id
end
def update_sql(sql, name = nil) #:nodoc:
ensure_master
super
@connection.affected_rows
end
def delete_sql(sql, name = nil)
update_sql(sql, name)
end
These XX_sql methods are sort of the internal versions of the public facing methods. Fortunately its the public-facing ones that query_cache method chains, and so these internal ones can be safely overridden.
