Heroku no longer using a global request queue
February 17 2011
I recently moved a larger Rails app over to Heroku and have been overjoyed with the ease of use and performance. All is going great, except for one problem; occasional H12 timeout errors on requests that normally take less then 200ms. A H12 error on Heroku means that a dyno did not return a response to the request within 30 seconds. The app was not overloaded and had plenty of spare dynos available to serve these requests, after a bit of testing, I found out why I was seeing these errors.
Contrary to Heroku documentation 1, Heroku is no longer using a setup that waits until a dyno is available before handing that request off to a dyno. Their current setup immediatly routes a request from the mesh layer to dyno in either a random or round robin fashion. This leads to a situation where in if there is a long running request that a dyno is handling, short requests can get queued behind it and ultimately timeout or be seriously delayed. This behavior is confirmed by Heroku:
If you have 2 dynos, and 1 is running “forever”, then 50% of your requests to your app will timeout. This is expected behavior on Heroku today. 2
You’re correct, the routing mesh does not behave in quite the way described by the docs. We’re working on evolving away from the global backlog concept in order to provide better support for different concurrency models, and the docs are no longer accurate. The current behavior is not ideal, but we’re on our way to a new model which we’ll document fully once it’s done. 3
The admin section of the app I recently moved over to Heroku is used daily by 20 or so employees. Their work flow has them making a few longer running requests to the app for report generation, sending emails, and file uploads. Most of these requests don’t take longer then 5-10 seconds and that was never a problem, but now it is. If the app has 5 dynos and one request takes 15 seconds, in the first second 20% of the requests to the app will have a 15 second delay. The next second, 20% of the apps request will have a 14 second delay and so on. The other 4 dynos may be available, but that one dyno will have a large and growing backlog. A simple request to the front page of the site that should take ~200ms could take over 15s.
At this point I have 3 options if I want to remain on Heroku. Optimize these report generators to the point they all take less than 1s. (easier said then done). The request could send the report to Delayed::Job, which saves the report output to S3. (Introduces more lag for the employee). Duplicate the app on Heroku and send all admin requests to this second app that the public never hits.
Heroku is a great service and the purpose of this post is not to speak bad about them, but to highlight the current backlog queue situation and provide anyone else an explanation if they are researching the same strange behavior. Though I do hope this may encourage Heroku to update their documentation and focus on getting a new backlog queue in place.
Update: Heroku has updated their documentation to mostly reflect how their backlog works. It still mentions the mesh holding a request until a dyno becomes available.