Background Processing in Rails
Probably the most important thing I’ve learned about using Mongrel is don’t be slow! Actions that take a long time (like greater than 5 seconds) will kill throughput since all other Rails actions will be queued up behind that slow one.
So the general advice is to perform long running tasks “in the background”. Ok, fine, we’ve done lots of that. But sometimes you have a task that essentially needs to be synchronous for your user, even though it takes a long time. In our case, whenever someone uploads a video to Vodpod, we want them to actually wait while we process the video so they can choose their favorite thumbnail. Now, many people suggest using BackgroundDrb for this, but that thing seems like overkill. It creates like 3 daemon processes and requires druby for communication. I wanted something simpler that would just use the db for communication.
So what we implemented is what I call a “pseudo-synchronous” tasks. The basic flow is pretty easy:
-User makes initial request
-server creates a “background job” and stuffs it in the db
-request returns
-User goes to “progress” page, periodic Ajax call checks progress
-server checks progress of background job in the db
-when job is done, then page does Ajax call to show the completed data
So we run the job asychronously to the mongrel processing, but use Ajax to indicate progress to the user.
Now here’s the trick. Rather than having a separate process to run the background job, we use fork to clone our mongrel process and have the child process run the background task. The advantages of forking include:
- fast – fork happens very quickly at the OS level. there’s no app startup time
- easy – All our current state is preserved, so we don’t need to pass arguments to some script
- local – we know the child process runs on the same machine, so if we need access to some local resource, like a file, we know it will be there. In a clustered environment it can be tricky to make sure that background processing has access to particular resources
There’s one big disadvantage to using fork – the child process basically wrecks our ActiveRecord database connection. AR stores the database connection in a static variable, and so the child process re-uses that connection. This causes problems since AR is not designed to have multiple processes using the same connection.
To get around the ActiveRecord problem, we have to have the child process create it’s own db connection, and we have to have the parent process close and re-open it’s connection. Altogether the code for forking the child and managing the connections looks like this:
class BackgroundJob < ActiveRecord::Base
# Spawn a new background process to execute this background job immediately. We have
# to muck with re-creating our ActiveRecord connections because AR doesn't normally survive fork.
# I wonder what else craps out...
def spawn
self.reload
if self.status == nil || self.status == STATUS_NEW
dbconfig = ActiveRecord::Base.remove_connection
pid = fork do
begin
# Monkey-patch Mongrel to not remove its pid file in the child
require 'mongrel'
Mongrel::Configurator.class_eval("def remove_pid_file; puts 'child no-op'; end")
ActiveRecord::Base.establish_connection(dbconfig)
run
ensure
ActiveRecord::Base.remove_connection
end
end
Process.detach(pid)
ActiveRecord::Base.establish_connection(dbconfig)
end
end
end
I’ve coded the fork call onto the BackgroundJob model class. This makes it super easy to create the background job and run it. Now my initial action looks like this:
background_job = BackgroundJob.create(Video, 'static_process_file', @video.id, ftp_file_name, current_user.id) background_job.spawn
The method that the child will call is the “static_process_file” method on the Video model class. Note that I don’t actually have to use a static method at all, I could actually pass in a Proc or an object and a method to run. This makes it really easy to take some long-running code you’ve got and split it off into the background process.
Now my Ajax-called action is easy:
def get_job_status
job = BackgroundJob.find(params[:id])
render :text => {:status => job.status, :message => job.message}.to_json
end
When I get job.status == “complete” then I have another Ajax call to retrieve the results of the background job (in my case a set of thumbnails extracted from a video).
I’m not 100% comfortable with the fork due to the problems with the db connection. I’m not sure I would want to run that code really frequently. In my case it only runs perhaps a hundred times per day. Nonetheless, we have this code running in production and I haven’t seen any problems. If anyone else has a better work-around for the db connection issue I’d love to hear it.
Update
Thanks to some awesome comments, I’ve updated the code to fix two problems. Tom suggested using “Process.detach” to prevent the child process from hanging around as a zombie. I’ve also added a bit of code to monkeypatch Mongrel so that the child process doesn’t remove the parent’s PID file. Obviously you want to remove this code if you’re not using Mongrel.
And an even bigger bonus, Tom created a whole plugin to handle the process forking. So check it out!
Update 2 Ruby-god Tom Anderson has a tricky “exit!” call at the end of his fork handler (in the child). This call ends the process without invoking at_exit handlers, which is what Mongrel uses to remove its PID file. This is probably safer than monkeypatching Mongrel as I’ve done. Not sure if there might be other at_exit handlers you would *want* to run, but given how the child process just has copies of resources from the parent, probably avoiding all handlers is a good idea.
How does this compare with setting config.active_record.allow_concurrency = true in environment.rb? This was mentioned in http://wiki.rubyonrails.org/rails/pages/HowToRunBackgroundJobsInRails
Is setting allow_concurrency true a substitute for messing about with connections or is it a precursor?
I have also seen some old comments that setting allow_concurrency true can lead to a leak of database connections: http://blog.moertel.com/articles/2006/08/24/database-connection-leak-in-typo-4-0-3-problem-solved
So far recommendations seem to conflict, so inbrief did you set allow_concurrency or not? Have you had this running long enough to see if there is a leak of db connections
Yep, as you say, it’s a bit of a kludge, but seems to be the best we have at the moment. I’ve seen a similar solution is presented here: http://blog.ardes.com/articles/2006/12/11/testing-concurrency-in-rails-using-fork
One thing I’ve found necessary is adding a trap directive just prior to the fork, to ensure that we don’t get hundreds of defunct sub-processes that never completely terminate (and prevent ‘cap restart’ among other things):
Signal.trap(“CLD”) { Process.wait2; }
Ed – I haven’t tried running the fork code with allow_concurrency. I’m definitely afraid to set that flag for our normal rails processes.
I have in the past run some daemon processes (using Rails Cron), and had to set that flag, but I have the distinct impression that that flag is not safe for your normal rails processes.
Joseph – thanks for the tip. Yeah, we’ve had some zombie mongrels. I’ll add your trap and update our code.
I’m glad I found this post. I think it will solve my problem of sending multiple emails on a request without having to resort to more complicated solutions like ap4r. I really wish rails would build something like this in.
One thing that bugs me is that you are removing and establishing the connection for the main process as well as the child. Ideally you’d like to just reconnect in the child process and let the parent have the original connection without disconnecting. I tried many things to get this to work but was unable to do so with the interfaces exposed by ActiveRecord::Base. Your method seems to be the most reliable way so far.
Also, have you considered defining your spawn method to do a yield instead of run then adding it to application.rb? I’m thinking along the lines of doing this in my controllers:
I’ll give that a try and see how it works. In any case, thanks for the fork ideas.
Oh, one more thing. I think it’s a good idea if you detach the main process from the child too, like this:
pid = fork do
…
end
Process.detach(pid)
The fork thing is working so well I decided to make a plugin out of it. You can download it from my blog if you want to try it.
My immediate situation is that I have a Rails app which offers one really long running operation (6 or more hours!), at preset while this is running everything is blocked.
As a quick and dirty hack could I use e.g. mongrel clustering. Say set up a cluster of 2 or more mongrel processes to run the same app. If a long running app is triggered then one mongrel process will be monopolised but the others will be available to service requests
Hmmm, round robin would derail this. 1 request in n would be sent to the busy process. Ho humm ..
Background processing is working great for me, thanks for the post!
There is just one minor problem. My mongrel pid files are getting deleted by my forked processes. Is there a way for a forked process to exit without removing the pid files of the parent process? Currently, I have to manually kill mongrels to stop my cluster.
Dave, you might want to try the spawn plugin which doesn’t call the at_exit handlers (where mongrel removes the pid files).
http://rubyforge.org/projects/spawn/
My experience on several high-traffic, medium sized sites has been that it can be very useful to have a server or two that are purely for asynchronous tasks.
With Rails, you’d probably want multiple Mongrel processes each consuming from your queue of asynchronous jobs.
Once you have servers set up like this, it’s amazing how much processing you can defer. Even some types of denormalization can be done asynchronously. You also can then throttle the amount you process depending on things like your database load. And slow web service calls are a natural fit here as well.
you might want to check out workling for very lightweight background processing in rails: http://playtype.net/past/2008/2/6/starling_and_asynchrous_tasks_in_ruby_on_rails/. currently, you can easily switch from spawn to starling. i’ve heard of people using a sparrow backend, too. you can even run the code inline, which is nice for debugging.
Hi!
I would like make better my SQL knowledge.
I red that many SQL books and would like to
read more about SQL for my position as db2 database manager.
What can you recommend?
Thanks,
Werutz
Hi people
As newly registered user i just wanted to say hi to everyone else who uses this board :-D