This week's topic is Django, the Python Web framework, and how to scale it. To deliver the goods, Jonathan Freeman interviewed two experts at the ticketing service Eventbrite with extensive experience putting Django to task: Simon Willison, Director of Architecture, and John Shuping, Principle Systems Engineer. Shuping has been at Eventbrite for years, helping it grow and transform, while Simon joined the team last year when Lanyrd, a company he co-founded, was acquired. If you're interested in learning more, be sure to check out The Lanyrd Blog and Eventbrite's Engineering Blog.
Jonathan Freeman: How are you using Django?
Simon Willison: We have two code bases we can talk about. We have the Lanyrd code base and the Eventbrite code base. As you know, Lanyrd was acquired by Eventbrite just over seven months ago. It's very interesting to contrast the two. Lanyrd is a greenfield development: We started out with Django three and a half years ago from a blank slate. Eventbrite's code base is much older, seven years old, and Eventbrite adopted Django three years ago, but there's still some code at Eventbrite that is in a sort of custom Python framework that predates Django.
They operate at very different scales. Eventbrite sold over $1 billion worth of tickets last year, so there's a huge amount of traffic and cash transactions going through that stack. Lanyrd is much smaller and doesn't have to deal with payment operations, so there's more flexibility. We can take risks with the code base because we're not dealing with people's cash transactions.
Freeman: "Scale" can mean a lot of things. What do you mean by "scale"?
Willison: There's more than one thing you need to scale. There's scaling up the amount of traffic that your site's handling, which is a very straightforward definition. The more hits per second you can handle, the better. There's also scaling the complexity of your code base. As code bases get larger and more complex, there are tricks you use to manage that complexity. Finally, there's scaling your engineering team. The Lanyrd team was six people, the Eventbrite engineering team is well over 80 now. There are things that you have to do differently when you have that many engineers working on the same code base.
Freeman: What do you see as being most important when scaling for traffic?
Willison: When scaling for traffic, the big thing -- this is something you get from Django and PHP and most Web frameworks these days -- is the concept of the shared nothing architecture. The application servers are dumb boxes that have databases and caches in the background that they're talking to, but fundamentally you can handle more traffic at the application layer by deploying more application servers. That moves the scaling challenges to the database and to the caching layer.
Scaling databases is always difficult. At Eventbrite, we make very heavy use of MySQL and MySQL replications. I'm not sure how many slaves we're running now?
John Shuping: For our core primary database, it's two masters and 10 slaves. There's no ability built into Django to route requests to a particular database for writes versus reads. One thing we've done at Eventbrite is instrument Django in a way that we can route inserts and updates to the masters and selects to the slaves.
Willison: The Django ORM has a few low-level hooks for letting you switch to a different database connection, but out of the box it won't solve sending inserts to one place and updates to another. We've written custom code at Eventbrite for that.
Another trick is we have separate slaves for things like long-running reporting queries. Expensive SQL queries aren't running on anything serving production traffic.
Freeman: What about database problems in production that required immediate changes?
Shuping: One really interesting thing came up probably three years ago. As you are buying a ticket to an event, it writes to the database and your subsequent page may want to read that info from the database to generate your confirmation email. We noticed that if a slave lagged by a half-second behind a master, you could end up writing to a master and reading from a slave shortly thereafter and it not actually being there.
So we devised two layers of protection around this database slave lag. The first is within Django; it's what we call DB pinning. Basically, it means if your code writes to the master, then any subsequent reads that it does for say, two seconds, are going to go to the master.
We also use a set of HAProxy load balancers in front of our database slaves that the Django config is actually pointing to. The load balancers are looking at all the slaves and doing a real-time health check on them. If it detects that one of the slaves has more than a two-second lag, we take it out of the pool of available slaves and it doesn't serve any traffic until it catches back up.
Freeman: With DB pinning, are you storing in the session? It sounds sort of like sticky sessions.
Shuping: We definitely do not do sticky sessions, which, to Simon's point, makes it really easy to scale horizontally. But you're right, for DB pinning, we use memcache. We have a cluster of four memcache servers that we're consistently hashing across and storing DB pinning tokens for your guest cookie or whatever may be in memcache.
Freeman: On the Lanyrd side, you switched from MySQL to PostgreSQL, right?
Willison: Yes, that's right. We made that switch about a year ago for a few reasons. People have written huge amounts of stuff about MySQL versus PostgreSQL and so on, but the one killer feature we cared about is that in PostgreSQL you can add a new column to the table without locking up the whole table. You can't do this in MySQL. We were getting to the point in Lanyrd that some of our larger tables were large enough that it became painful adding new columns to those tables. The big benefit we got from PostgreSQL is that, having moved over, it was much easier to make modifications to our database tables.
[Note: Lanyrd's transition from MySQL to PostgreSQL was done in two hours with the site up, but in a read-only mode.]
Willison: Eventbrite runs on MySQL and uses a technique called pt-online-schema-change to add new columns to MySQL tables. This means you can modify your tables at runtime without any downtime to the site. But there are features in MySQL you can't use, such as foreign key constraints at the database level because those aren't compatible with the way we do replication and the way we do online schema changes.
Shuping: To jump ahead to your point about maintenance, we have an operational goal of not having any planned maintenance or downtime or read-only mode; pt-online-schema-change is one key thing that contributes to us being able to do that.
Willison: This is another difference between Lanyrd and Eventbrite. At Lanyrd we can throw the site into read-only mode for an hour and it doesn't cause problems, whereas Eventbrite sells tickets. An hour of not being able to sell tickets to people is not good for us and not good for the event organizers that rely on our platform.
Shuping: One random, supertechnical anecdote you guys reminded me of is that I think Django was historically more oriented toward PostgreSQL. We learned we needed to know how the replication is done: Is it row-based replication or statement-based replication? PostgreSQL operates by default in a row-based mode, and with our investigating those slave-lag issues, we found that MySQL defaults to statement-based replication when in fact Django prefers row-based. By switching MySQL to row-based replication, we minimized the lag issues.
Freeman: What else helps you scale?
Willison: I'd say one of the most crucial techniques, and the technique that certainly gives you the most payback for time invested, is to have a good feature-flag system. We built our own at Lanyrd, and Eventbrite uses an open source feature flag system (Gargoyle). This gives you an enormous amount of power and flexibility in managing a complex site.
Feature flags essentially are a way of saying features can be turned on or off across the whole site, or they can be turned on for specific users. You can say our internal users can access this feature at the moment while we're testing it, or our outer testing group, or turn a feature on for these couple of large-scale events. We use both for Lanyrd and Evenbrite, and we use these everyday. Almost every piece of code that we write ends up hidden behind a feature flag at one point or another.
It's fantastic for scaling your development process as well. It means that rather than your developers working in month-long feature branches and having a complete nightmare when they try and merge that into master, feature branches tend to exist up until the point at which a feature flag is in place. Then they can be merged back into the main code base. Your code is constantly being exercised, and it's visible to all the other engineers in the company.
It also means that launches are much less stressful. With a big launch of a new feature, you have to do a big deploy to turn the feature on and you have all the normal worries around making a major code deploy. With feature flags, you deploy your code the night or the week before. Then on launch day, all you have to do is flick the feature on, and it becomes available to your audience. It also means that rolling back is much easier because you can turn the feature off; you don't have to do a major rollback procedure if something goes wrong.
Shuping: We've added something on top of that called the "experiments framework." You can enable a feature flag for some percentage of users, so we've instrumented this in a way that we can conduct A/B tests behind feature flags. If we have a new feature and want to test it against an old feature, we enable this flag in such a way that it does an A/B test for us and allows us to go and see the result later.
Freeman: You can then adjust the feature or roll it out to more users, based on the results.
Willison: From talking to people who run large-scale websites, this technique is increasingly common. If someone's not running feature flags on their project, it's one of the first things I suggest they do. It makes such a huge difference to your productivity, your teamwork, and your level of confidence that you're able to control the code you're rolling out.
Freeman: For people who are thinking about building this into their process, what sort of investment does it take?
Willison: It can be really simple. At Lanyrd we had a custom implementation, and the initial version took half a day to build, and it benefited us everyday from then onward. For Eventbrite, I don't know how long it took us to implement the first version, but again it was using this OS package. It can be a case of dropping in Gargoyle and starting to use it, especially if you have a standard Django setup.
Freeman: That's great. That touches on both how to scale your code base and your engineering team. Are there any other tools or techniques that come to mind?
Willison: A topic you mentioned earlier is Celery. Both Lanyrd and Eventbrite use Celery. I think some kind of offline processing should be part of the default stack for any website that you build these days. The moment something takes more than a few seconds to run, it should be running asynchronously.
Shuping: I could ramble about Celery for the whole 30 minutes. A lot of my time at Eventbrite has been making Celery awesome. It really helps get the job done while keeping the actual page response times fast for the end-user. We use RabbitMQ as our message broker, and we use multiple clusters of it. On each Web server, we have a local HAProxy running that routes tasks across those multiple RabbitMQ clusters onward to Celery servers.
Freeman: Can you give me an example or two of longer-running tasks you guys see frequently?
Shuping: Your order confirmation email. When you buy a ticket, you don't actually have to wait while the Web server generates the PDF of your ticket. We farm that off to Celery, and you get the page that says, "Here's your order confirmation number." By that time, your email has already come in because Celery's done it in the background.
Freeman: That's on the Eventbrite side. On the Lanyrd side, you're using Celery also, but backed by Redis.
Willison: That's right. We use Redis as our message broker because it's much more lightweight. We didn't need to do the full, clustered RabbitMQ setup. I think if we'd had time we might have gone with the RabbitMQ setup, but Redis is shockingly powerful out of the box for this kind of thing.
Freeman: Is Varnish part of your stack?
Willison: At Eventbrite we've used it for a few small things and we're looking at increasing our usage. At Lanyrd we primarily used it to deal with unexpected traffic spikes. If you're not signed in, we serve your request via Varnish with a 60-second expiration. If you are logged in, it skipped the caching layer entirely. That means if somebody with 3 million followers tweets a link to a Lanyrd page, we don't even notice the uptick in traffic.
Freeman: I had read that Disqus uses Varnish in its architecture, and it really does a lot, but it sounds like it's not a key component to what you're doing.
Willison: It's not yet. Disqus is doing really interesting stuff with it, and we're keen on doing more with Varnish. It's an incredibly powerful tool once you get into the details of VCL and its different capabilities.
Shuping: On the caching note, I'd love to give a shout-out for Redis. On the Eventbrite side, we use a ton of Redis. We do a lot of back-end analytics in Hadoop to calculate things like events recommended for you based on your previous purchases. Instead of letting Django query Hadoop directly, which might be slow, we have Hadoop upload the results and keep them updated in Redis. The site ends up querying Redis, which is superfast and we love it. We're definitely making use of Redis Sentinel, which is the automated master-slave failover system. We've written our own Redis Sentinel client that also supports sharding of clusters, which we rely on heavily.
One thing I wanted to touch on from way back was the scale question -- requests per second on the systems side. What's interesting about Eventbrite is that we don't always know when there's going to be an onsale for a huge event where 500,000 people are going to show up at our site to buy tickets for the Black Eyed Peas or whatever it may be. On the ops side, we're using Puppet and everything's automated. We actually have, within our ops team, our own Django site to manage our instances. We basically have our site built out to four times the capacity at any moment, so if a big onsale comes along and a ton of people hit the site, we can absorb that traffic and handle it.
Freeman: So you hold your capacity padded four times out. Is that ever not enough? Do you ever have to scale out beyond that?
Shuping: No. It gives us the max performance from the order side for where we're built out now. We've devised a system to handle excess traffic called the "waiting room." It's part of our core Django application, running on a separate set of servers. Basically what happens is we keep track of how many people are registering for a particular event at any given time. If it exceeds a certain threshold, we let the first 5,000 people start typing their credit card number in. When customer 5,001 comes in, we send them to the waiting room. It's a static page that's AJAXing to a separate set of servers asking, "Is it my turn? Is it my turn? Is it my turn?" It effectively lets all these people get in line, literally, for the available tickets. This lets us absorb the extra load to a different set of servers.
Freeman: So you're really not dealing with elastic clusters. You have a steady setup, and for some overflow, it spills over.
Shuping: Yeah, most people do autoscaling, but we take advantage of that more in QA and stage environments. We'll say, "Let's build up this cluster and see what happens if we do this," but we don't do autoscaling in production.
This story, "Expert interview: How to scale Django" was originally published by InfoWorld.