While most of us have built really cool websites, realistically speaking, few developers have had to worry about the complexities of managing and scaling incredibly large websites. One thing is putting up a site for a small company to ensure they have a great presence and another is trying to figure out how to scale your site so it won't buckle under the load of thousands of users.
I was fortunate enough to chat with the folks a flash-sale site Gilt.com which has received quite a bit of press over the years and seen tremendous growth. It's opportunities like these that allow us to probe the team that manages these sites and learn how they handle their day-to-day business and technology.
In this interview, Eric Bowman, VP Architecture at Gilt Groupe takes us through some of the background behind the site and the technology decisions behind keeping the service running smoothly.
Q Could you give us a quick intro about yourself?
I'm incredibly proud of what the team has accomplished.
I've been with Gilt since August 2011, and became VP/head of architecture and platform engineering in December 2011. During my time here, we've transitioned from Java to Scala, adopted Gerrit for code review, implemented a continuous delivery system we call Ion Cannon, introduced a typesafe client and microservice architecture, rolled out Play 2 for our frontend, created a public API, and rolled out platform engineering as an organizational architecture. I'm incredibly proud of what the team has accomplished. Before Gilt I was an architect at TomTom in Amsterdam and was responsible for their online map, search and traffic APIs, and products. Prior to that I was an architect working on service delivery for 3's global 3G launch offering, and a long time ago my first "real" job was building The Sims 1.0.
Q Could you set an expectation for our readers of the scale/size of Gilt.com so they get a better feel for the breadth of effort needed to build a large-scale site?
The flash sales model presents a unique technical challenge because so much of the traffic comes in these incredible pulses as new sales go live. Over the course of a few seconds, our traffic can increase by as much as 100x, which really puts stress on every part of the system, all at once. Essentially, we need to have the eCommerce infrastructure almost at Amazon scale for at least 15 minutes every day. Most days this happens exactly at noon EST, and until a couple of years ago, noon was a stressful time every day. Nowadays it's usually a non-event--in part because our software is great, and in other part due to better visibility into system performance and behavior.
In order to accommodate the pulse, we tend to over-provision on the hardware side. Our customer-facing production environment at the moment consists of about 40 physical servers running a couple hundred micro-services and a few dozen user-facing applications. On top of that we have about another 100 servers for development, testing, data warehousing, analytics and development infrastructure.
Q When it comes to large web properties, most developers are curious about the technology that runs under the hood. Could you share what you're using and what prompted some of the technology choices you've made?
On the database side, we depend heavily on PostgreSQL, MongoDB and Voldemort.
Gilt was originally a Ruby on Rails application with a PostgreSQL backend. We had serious problems scaling Rails to handle the noon pulse, and the core customer-facing systems were ported very quickly to Java and a coarse-grained, services-oriented architecture starting in 2009. We kept the Java extremely low-tech: JDBC, hashmaps and JSP.
The performance and scalability of low-tech tools on the JVM is astonishing. As we grew the tech organization in Gilt, though, it became increasingly hard for teams to contribute code. A downside of the low-tech approach was that the contracts between systems were not well defined, and the most critical code bases grew monolithic over time. We gradually transitioned away from a pure servlet-based service implementation towards JAX-RS, and in parallel increasingly toward Scala. Our service stack is similar to Dropwizard, which came a few years after we built our internal platform, built on Jersey and Jackson 2. We like Dropwizard, but found that the way apps are configured--at runtime, in particular--wasn't very compatible with our infrastructure, which uses ZooKeeper for discovery and per-environment configuration.
We've also moved from Ant to sbt over the last year. At TomTom I grew fond of Maven, and spent some time trying to introduce it at Gilt. At the point everything was falling into place, I had a change of heart and did some experiments with sbt. We found that sbt provides a fantastic developer experience and has an incredibly powerful extension model. Switching to sbt has enabled a degree of tooling customization that previously seemed impossible, and a lot of great features have fallen out of the sbt adoption, such as deep integration with our continuous delivery system and automatic dependency upgrading–things we couldn't even imagine with tools like Ant or Maven. It was an interesting case where the limitations of the tools limited our imagination, and an important lesson for me personally in how to recognize and avoid that antipattern.
On the database side, we depend heavily on PostgreSQL, MongoDB and Voldemort. PostgreSQL has been part of the Gilt stack from the beginning, and has been amazing. We were one of the sponsors supporting the development of hot standby, a key feature in PostgreSQL 9.0 that enables true replication. We run it on Fusion-io cards, and the performance has been phenomenal. Our CTO, Michael Bryzek (also a Gilt cofounder), recently released a really nice open source schema upgrade mechanism for PostgreSQL. In general we've been moving away from a monolithic database application towards individual private databases per service. PostgreSQL is really nice for this, and its stable, predictable performance makes it straightforward to predict system behavior and to provision smartly.
Both MongoDB and Voldemort have become increasingly important in the last year or so. Voldemort has been part of Gilt's stack for some time, though usage of Voldemort didn't grow at all until this year. Despite less momentum than some other NoSQL solutions, we find Voldemort to be incredibly reliable and straightforward to reason about, and gradually we've introduced it in a few more places. We've wrapped it and incorporated it into our core platform, which makes it straightforward to use in new services; it's easily embeddable, leading to almost no additional infrastructure needed to run a reliable Dynamo-style key-value store. We've also looked at a number of other solutions in the space, including Riak, and we're pretty excited by all the activity in the field--particularly around multi master databases with strong semantics on conflict resolution.
MongoDB has also become increasingly important at Gilt over the past couple years. Today we run our core user and authentication services on MongoDB–absolutely critical data with very high throughput and low latency requirements–and it has been running now for a long time and very smoothly. MongoDB gets a hard time in the community sometimes, but we've found it to be astonishingly fast when run on high-spec hardware, and we've been reluctant to consider other options because of its raw performance in our use case.
Q Focusing on scalability specifically, what were your expectations when Gilt was launched and how did you prepare for launch traffic and post-launch growth?
Gilt grew faster than anyone could have imagined, and everyone was caught off guard by how difficult it was to scale the Rails stack. Like any successful startup, Gilt was assembled just-in-time, and the bulk of the effort was spent trying to react to what the market fed back in the face of so much public interest in what Gilt was trying to do. Startups in this situation have to maneuver a knife-edge of "just enough" architecture. If you overthink or over-engineer too much or too soon, it's impossible to move fast enough to capture the market you are going after. But if you don't architect enough, you can't actually adapt once you've got something running. Gilt's core tech stack has always embraced simplicity as a key feature, and maintaining that over time has been a worthy challenge.
Q Post-launch and obviously using hindsight, which decisions do you feel were spot on and which do you wish you could have a do-over on?
The decision to use PostgreSQL was spot-on. Despite the scaling issues with Rails, it's an amazing framework for moving fast--until you need to scale. So that wasn't necessarily a wrong decision, and a lot of Gilt's internal systems today are written in Rails. If we were starting over today, we'd probably build on top of Play 2.
Q Of the technologies you've leveraged, which specifically helped in terms of scalability?
In terms of technology, we've found the JVM to be very scalable. Over time a number of tools have come into play to make scaling easier. Scala, ZooKeeper, RabbitMQ, Apache Camel and Kafka come to mind as important for scalability.
However, scalability at Gilt has had less to do with specific technologies, and more to do with architecture and approach. We've never been afraid to rethink things almost from the ground up, and scaling is a multidimensional problem that covers technology, infrastructure, architecture, and organizational structure. We've iterated along all four of those axes.
Q Being a commerce-oriented company, safeguarding customer data I'm sure is priority #1. From a security perspective, how have you had to adapt your infrastructure to adjust to the constantly changing security landscape?
We take security and our customers' privacy very seriously, obviously. I don't want to go into too much detail here, but a few things stand out. We take PCI compliance extremely seriously, and everyone participates on some level in the PCI review process. We've architected our systems using a bulkhead approach to PCI compliance, which physically limits what needs to be PCI-compliant, and also reduces risk in the event of a number of possible breach scenarios we model. We've found a micro-services architecture, and continuous delivery make it relatively inexpensive for us to stay cutting-edge in terms of security-related best practices, and so we try hard to do so.
Q Along those lines, what has been the most challenging aspect of security to manage?
The biggest challenge by far is coming up with a realistic model of what the risks really are, and then making the right decisions to mitigate those risks. Despite lip service about how security is everyone's problem, in practice it's hard for developers to keep security in mind all the time. We've focused more on an architecture that is forgiving and partitioned so that we don't compromise security, and we reduce the scope of any particular potential mistake.
Q How has open-source software played a role at Gilt, both from a technology and financial perspective?
From a financial perspective, open source has helped us keep our costs down, and also helped us move faster.
Gilt is built almost entirely using open-source software. We actively encourage our engineering teams to use and contribute back to open source, and we have really low-friction guidelines for how to contribute back to open source. We have a number of open source projects we host on our GitHub repo, and we constantly feed pull requests upstream to some of the most important open source projects we use. We also actively support open source efforts, from funding feature development in PostgreSQL, to sponsoring Scala conferences like Scala Days.
From a financial perspective, open source has helped us keep our costs down, and also helped us move faster. Besides the obvious benefit of not paying licensing costs, open source provides a more subtle advantage, in that when you run into an issue, whether a trivial one or a catastrophic one, you can both look at the source code and potentially fix the problem. I started developing in a closed-source environment, and I do not miss those days where everything was a black box, and licenses were enormously restrictive in terms of what you could do with--or, in some cases, even say about--commercial software.
Q When you looked at a specific OSS-based technology, what was your decision-making process for determining it's viability, applicability to your needs and the longer-term management of the technology?
We try to actively follow the latest developments across a number of open source projects, and read blogs, and generally we all love technology and tend to get excited. So sometimes it's hard to avoid the irrational exuberance that can follow if you read too many blog posts and believe all the hype. We have an internal peer-review system that encourages a lightweight planning and architecture discipline, which works pretty well. We also use things like the ThoughtWorks Tech Radar to help temper over-exuberance, and also to gain insight via another lens of what's working well across the industry.
Our approach also depends on how critical the software is. At Gilt we talk a lot about "Voluntary Adoption," which actively encourages our developers to adopt the best tools for the job. In practice this means that individual teams have a lot of leeway in terms of leveraging whatever open source libraries they want, and when this goes well, it helps to keep things simple--and also helps us move faster. Usually the benefits of these libraries are clear; we tend to leave it up to individual teams to do the right level of analysis around the tradeoffs and costs of a particular solution. It is a struggle to avoid too much fragmentation across the organization, and we actively work to understand when teams have needed to use fairly exotic libraries, and incorporate them into the core platform in a way that tries to minimize upgrade pain and "dependency hell."
For more critical shared components and systems we tend to use a combination of consensus, peer review, and stress testing to make decisions. Sometimes we look at a system and it's so obviously superior to the other options that consensus is easy and we move quickly to adopt. ZooKeeper is an example of this. In other cases when the choice is less clear, we tend to just spin up the various alternatives and run them until they break, and try to understand why they failed, if they failed. For example, when evaluating messaging systems, we found that pumping a billion messages as fast as possible through several contenders was a pretty good way to eliminate poor choices via a "last man standing" criterion.
For now our approach is fairly lightweight and agile, and we hope to keep it that way. Microservices and a unique approach to deployment make it straightforward for us to try things out in production to see how they work, without much risk. Ultimately how a system works in production is the most important criterion, and not one you can divine through documents and meetings. We try stuff and use what works.
Q OSS relies heavily on community contributions. How does Gilt give back to the OSS community?
On the Java and Scala side, we've not contributed as much or as quickly as we'd like, due to some specifics of how our build works that makes it hard to build some of our most core software outside Gilt's infrastructure. We are actively working on improving this, and we have a backlog of Java and Scala software we look forward to open sourcing in the next half year or so.
We've also funded specific features in PostgreSQL, for example, and we regularly sponsor conferences around open source topics--primarily Scala in the recent past. We also are large supporters of the technology groups where we have our main engineering centers (New York and Dublin)–opening up our offices to host local meetups, gatherings of technology groups, and even free all-day free courses.
We also include open source in our hiring process, in that we tend to prefer developers who are or have been involved in open source. In general we see it as a great sign that a developer is used to code that gets read, and also has priorities aligned with how we try to develop at Gilt.
Eric, I'd like to thank you for taking the time to talk with us about Gilt. We appreciate the transparency and comfort you having in sharing many of the architectural underpinnings of such a highly-trafficked property. The complexity and diversity of the technologies you use shows that scaling a site requires more than just choosing a stack and running with it. It also demonstrates that it's important to objectively look at all of the options available and choose (sometimes adapt) to other products that can help your business be successful.