HAProxyConf 2019 – How Booking.com Powers a Global ADN with HAProxy by Marcin Deranek
- Articles, Blog

HAProxyConf 2019 – How Booking.com Powers a Global ADN with HAProxy by Marcin Deranek

Hello everyone. Good morning. Over the years booking.com has grown from
a small start-up to one of the largest travel e-commerce companies in the world. Such growth is good for the business, but
at the same time it is a challenge for our engineering teams to scale the infrastructure. Most of the systems designed at booking.com
last a few years, enough for that they need to be reengineered or re-architected to accommodate
the growth. Our load balancing platform is no exception. So during this talk I’m going to share with
you how we scale our load balancing infrastructure from a pair of load balancers to hundreds
of load balancers handling billions of requests a day internally and externally. So in 2008 when I joined booking.com, our
load balancing infrastructure looked like this. We had a pair of Linux boxes running IPVS,
so it’s IP virtual server. It’s a load balancing solution in the kernel
provided as a module; and for failover we use Keepalived. So there was, like, an automatic failover
that could take place. Later on, we started actually using Puppet
in the company, so these servers were managed by Puppet. Even, I think, our git history still remembers
some puppetry around these boxes. So it worked pretty well. We had some problems, for example, ARP caching
was one of them. When the [???] basically was caching ARPs
and we did failover and the traffic was still going to the old device for some of the site. But overall we were happy with it. The main problem we saw with this was layer
4. Actually these load balancers were working
fine, but we wanted to do more; and that’s why we ended up with five appliances, where
these were layer 7 load balancers; and layer 7 opened up a whole world of new possibilities,
which we liked a lot. For example, you could alter the payloads,
inject headers, remove headers, as well it gave us, like, a possibly of routing traffic. For example, in case of something, some condition,
you can send it here or there. Again these devices were active-standby, so
one was, like, doing the job. The other was, like, sitting there and doing
nothing. We didn’t have any orchestration around that. Later on, we started using Puppet as puppet
proxy, but unfortunately only a certain set of objects on the device. So, all those objects related to the load
balancing itself, so like virtual servers, pools and nodes we were able to manage; but
like device-specific properties like users, let’s say timezone of the device, we had to
set them manually. So set all of these devices when they came
from…when we purchase them. They had to be manual, and upgrades as well. Especially, at a certain point in time, we
had like a lot of vulnerabilities discovered in the SSL etc., so we had to kind of do lots
of these upgrades. So these processes were manual. Over time, we developed some scripts and these
scripts allowed us to automate some parts of the job, but still there had to be some
operator behind it doing the thing; and, unfortunately, these devices were vertically scalable. So when our traffic was growing for the site,
we had to buy bigger devices and over time we ended up with this. Essentially, we had like a lot of different
pairs of different generations of hardware. The way how it worked was, usually, the traffic
for the main site was growing, so we had to buy a new pair of devices. We were buying those, the most powerful, the
most expensive ones for the main site and those which were currently used for the main
site were repurposed for some other lower traffic sites. So at the time, we ended up with all these
generations, lots of this hardware lying around; and we hit, at a certain point, that we had
already kind of reached the sort of top-of-the-line where there was no room to grow up. And actually, the VPN solution from F5 we
didn’t actually, we didn’t like it a lot. So we started…we did some research, and
we thought, there must be a better way of dealing with the problem. We came across the Equal-Cost-Multi-Path routing. Equal-Cost-Multiple-Path routing is a strategy
for routing network packets along multiple paths of equal cost. When forwarding a packet, router decides which
next hop to use based on the results of the hash algorithm. In our case, we use per-flow load balancing
to ensure single flow remains on the same path for its lifetime. In this case, the flow is identified by source
IP address, source port and the destination IP address and the destination port. Interesting things happen when the amount
of nodes changes. In this case, some of the flows will get reshuffled;
they’re going to change the path. So depending on the different implementation
of the hash algorithm, ideally you want only 1-Nth of the flows to change their paths because
let’s say you have four nodes and let’s say one of them died or you take it for maintenance
and then, only 1-Nth, so the flows that would go through the node number five would get
reshuffled to the other ones. And actually, some vendors actually provide
this. Unfortunately, some other ones, at least at
the time when we tested it, many more flows get reshuffled. So depending on this hash algorithm implementation
you might want to check it before you go on. You want to buy some, likely, you go on shopping. ECMP was like, we like it a lot because it
allows you to do load balancing at the line rate. I suspect it’s mainly due to that there’s
no session table, right? So there’s…so usually…it was like the
problem of F5, like a limit of this amount of sessions etc., whenever the sessions go
up, but the things get slower and slower. In this case, you have a constant speed because
for every packet you apply the hash for the packet. So this is how our architecture in one availability
zone in a single environment looks like. So we have two tiers, we have a layer 3 tier
and a layer 7. Traffic comes from the Internet. It hits one of the fabric layer switches. A single IP address, by default, is assigned
to a single fabric layer switch, even though when this switch dies the other switches will
take over automatically. So this one is like as a primary, but, like,
this one dies and the flows will get up and handled by different fabric layers switches. Then when the traffic hits fabric layer switch,
then the fabric layer switch will round the hash algorithm on the flow and will have,
because it has five alternative paths, and it will select for what traffic to one of
the top-of-rack switches. And then, again, when the traffic hits one
of the top-of-rack switches based on the hash algorithm of the flow, it will have two alternative
paths because we have two HAProxy load balancers behind it. Then it will, based on the hash result, it
will forward it to one of the F5…sorry, the HAProxy load balancers. In our architecture, we have different load
balancer groups. So, on the diagram there, they are presented
as blue and white HAProxy logos. The load balancer group is a set of load balancers,
which is configured in exactly the same way and they handle the same set of sites, the
same set of IP addresses. So, creating such groups allows us to isolate
certain sets of sites from each other. So let’s say that something bad happens to
a certain load balancer group, it will only affect this load balancer group and not the
others. Obviously, it’s possible that a single IP
address, even though we don’t do it by default, is handled by two load balancer groups, but
usually we do it when we transition, let’s say, one site from one load balancer group
to the other. For temporary, it will run through two groups
and two groups will handle the traffic. So, in this example, we use the, for a different
site traffic goes to a different fabric switch. Again, the fabric switch will have five alternative
paths. We use exactly the same top-of-rack switch
infrastructure. And when it hits top-of-rack switch, top-of-rack
switch will forward to the blue HAProxy load balancer. We use Equal-Cost-Multipath routing in combination
with Anycast. Anycast a process for routing network traffic
where sender traffic is sent to a destination that is closest to it in terms of network
topology. Closest could be defined as lowest cost, lowest
distance, smallest amount of hops, or potentially lowest measured latency. In practice it means that, essentially, a
client goes to the endpoint, to the load balancing platform, which provides the lowest latency. We prefer locality. You always go local, unless the local is not
available, you go somewhere else, which is the closest. So, in this example we have a client in availability
zone 1 and because of the lowest cost, it will go to the load balancing platform, which
provides the same service as other availability zones, and then it will be handled by the
service in the same availability zone. So it provides the lowest possible latency. Additionally, when, let’s say, the load balancing
platform goes down or it’s not available or the load balancing platform is down there,
then it will choose the next closest availability zone, servicing with the cost 2, in our example. And it will be handled by load balancing platform
in availability zone 2 and the servers there. This will provide us automatic failover and
resiliency of our service. So, this is our architecture and the data
plane. To manage that we’ve built our own software. We call it Balancer API. This software is used to manage our whole
load balancing platform in the whole company. Obviously, we have different clusters to handle
different environments, but this is like the same piece of software. At the center of this software we have Balancer
API, which essentially stores the whole configuration in the database and the configuration is stored
as objects with attributes, and as well with a relationship between those objects. So for example, we have load balancer objects,
which corresponds to the physical load balancers. We have virtual servers, which correspond
to HAProxy frontends. We have pools, which correspond to HAProxy
backends. Servers. Business logic is a snippet of HAProxy configuration
which can be assigned to either a virtual server or pool; and some of this business
logic can be context aware. So depending to which…in which context it
gets assigned, to which virtual server it is assigned, it might do slightly different
things. For example, if it gets assigned to virtual
servers it can actually deal with different…it will actually use different sets of pools,
a different set of backends. So basically, you build, like, one logic and
you can assign it to different virtual servers and will do certain things depending on which
context it is assigned. It will actually figure it out itself. Maps, access lists and files, or data files. We store data there and they can be used by
business logic. IP addresses can be assigned to virtual servers
and SSL certificates as well we manage there, or can be assigned to IP addresses and there
are other values like smaller objects. So, how do servers and load balancers gets
configured? So this diagram describes basically the whole
flow of how the server essentially ends up in the configuration, how the server gets
from the point where it gets provisioned to the time, to the point, where it actually
gets traffic. So our servers are kept in the server inventory
where, basically, some of them are racks available for use; and somebody basically comes and
wants to provision one server to be a web server to handle traffic for a specific service. So somebody comes to the user interface and
clicks that I want to provision this server as a web server. At this point, the server inventory will send
an inventory update to Balance API. So this is our brain for our load balancing
platform. In this inventory update, we will get the
name of the server and some attributes like IP address of the server, and also, as attributes
like name of the potential class that the server belongs to, the roles it belongs and
the availability zone. Every time we provision a server we don’t
want to manually go to every pool and say like add this server here, add this server
there. The way how we solve it is by when we create
a pool we give it a specific set of attributes like a role, availability zone, and potential
clusters, and these attributes you can think about like a set of rules. If this newly provisioned server, an inventory
update is sent to the Balancer API, and these attributes of the back of the pool will match
the attributes of the server. The server will get automatically added to
the pool. So we want two things happen automatically. So, every time some new server gets provisioned,
it will automatically get the pool, the relevant pool it’s supposed to be in. When the server gets added to the relevant
pools in the Balancer API, then the server gets provisioned: operating system gets installed
and then Puppet kicks in to configure the server. At this point, Puppet will request configuration
from Puppet master and Puppet master, on behalf of the server, will request server membership
for specific pools the service is in and then it will configure Zookeeper agent, which will
run on the server. So the role of the Zookeeper agent is to make
sure the service provided by the server is okay. So the server is healthy, so the server can
handle live traffic. So Zookeeper agent will run on the server
and then once it makes sure the server is healthy, it will register to Zookeeper under
specific namespace, which was configured by Puppet. This namespace is watched by Zookeeper to
Balancer. It’s a separate daemon, which acts as a proxy
between Zookeeper and Balancer API. So essentially, proxies the information from
Zookeeper into the Balancer API. So once it will see the internal node or the
specific server in a specific namespace, it will enable this server in the Balancer API
for specific pools. So at this point, server gets enabled. Initially, when the server gets inserted through
the server inventory the server is disabled by default, for a good reason. We don’t want to any server to get traffic
by default, right? We only want it to get traffic when we confirm
it’s healthy, everything’s fine. At this point, the information gets propagated
through Zookeeper and then Zookeeper to Balancer to enable it. At this point, server is ready to receive
live traffic. Similarly, when the, say, health check fails
or server goes down, then the internal node gets removed from Zookeeper and then this
information gets propagated to the Zookeeper to Balancer, and Zookeeper to Balancer will
remove…sorry, will disable the server in Balancer API. So this is like how the information is stored
in Balancer API, but how this information gets propagated to the load balancer? So, on every load balancer we run a Balancer
agent. The role the Balancer agent is essentially
to configure all the components and get configuration from Balancer API and configure relevant components
of our load balancing platform. In this case, Balancer API will check for
configuration changes and will first configure HAProxy, HAProxy software. So it will configure it, the configuration
can take place in two ways. If the configuration change can be applied
over the socket at runtime, it will do that; and then you will update the configuration
file. If the configuration is not supported over
the socket or there are too many changes, for example, a hundred whatever, so it will
fall back to the reload. So update the file and then schedule reload
of the HAProxy. Once the HAProxy gets configured with all
the relevant services, Anycast healthchecker will get configured. The role of the Anycast healthchecker is to
essentially make sure that all the services which are configured on HAProxy are healthy. So, actually, it will check all the services
behind single IP address. Once the older services, because behind a
single IP address we can have multiple services; let’s say we can advertise we can provide
HTTP/HTTPS or, in some cases, we provide SSH for example for, like, for Gitlab, for example. So in this case once the Anycast healthchecker
will confirm all services behind the single IP address are healthy, it will reconfigure
a BRD, BGP Routing Daemon, and BRD will start announcing this IP, this prefix or this IP
address to the upstream. In our case, it’s a top-of-rack switch. Then when the top-of-rack switch will receive
this advertisement and also the traffic will start flowing for this specific IP address
to HAProxy load balancer. So this is our load balancing…then HAProxy
will forward this traffic to the backend server. So this is our load balancing platform. We need, as well, a good visibility of that. So we have extra daemons running on the load
balancer to make sure we have good visibility what is happening. So we have created haproxystats demon, which
periodically dumps the information from HAProxy over stats socket and will submit it to Graphite. Similarly, Collectd is running on the box
itself as well to collect operating system specific metrics and submit them to Graphite. These metrics can be used, for example, to
build some nice dashboards or to set up some, let’s say Bosun monitoring if certain metrics
go below a certain threshold, etc. Additionally to Graphite metrics, we send
access and error logs to Rsyslog. Over UDP, obviously, to not block; and then
Rsyslog, local Rsyslog will forward these log messages to a central logging system and
then eventually they’re going to end up in Elasticsearch cluster where they can be analyzed
and then we can build some visualization, do troubleshooting, etc. Because then we have our own per-request logs. Based on the data stored in Elasticsearch,
essentially we can see different things. It’s up to you to decide what you like to
see there. HAProxy has plenty of different features so
what you want to actually dump there it’s up to you. There’s like lots of data. So, in our example, you can see that, like
you can compare different TLS, what is the TLS handshake time for different TLS versions. The variability for TLS 1 and 1.1 comes from
the fact that actually the majority of traffic these days uses TLS 1.2 and TLS 1.3 and we
don’t have enough samples for TLS 1 and 1.1 and that’s…the graph is a bit jumpy. Additionally, you want to make, like, educated
decisions based on the data you store there: whether you want to enable TLS resumption
or not and how much gain you can get from enabling that. At booking.com we aim to provide load balancing
as a service. So that’s why for the balancer we create multiple
user interfaces, so people who don’t know much about load balancing can actually use
our service, our load balancer platform. So the first user interface we created is
a visualisation tool, you have a screenshot on the screen. This user interface allowed people to actually
see how the configuration looks like. So basically visualize it and as we click
on the relationship, so then they can see how the objects are connected. So if they can find object, then can kind
of follow the relationships; and as well, we allow them to change attributes. So they can actually modify the configuration
at runtime. So they can go there to the UI and then just
send like a changing ratio or whatever. So they can do it. Obviously we limit the scope of the actions
to the roles they maintain. They can only manage their own services. We also created the “Create Pool” and “Create
Site” user interfaces. I’ll talk about it in a second, which they
allow them to create a configuration for the newly built services. Additionally, we created a failover traffic
user interface, failover traffic is for…basically easy failover traffic between different availability
zones in case of, like, an outage or bad rollout or servers being down. So in the very easy way, basically they just
click one button and that traffic will just get away from certain availability zones. We also provide them access to logs and metrics. So if they want to troubleshoot problem regarding
their service they can go directly to the data or if they…we provide a set of dashboards,
like generic dashboards, but if they want to build their own dashboards with custom
stuff, we provide access to metrics directly. So for for the role owner to create their
own service, first they need to…they need to create a pool. So first they have to specify like a role…the
most important attributes is like a role, availability zone, and potential clusters. So these three attributes will describe which
servers are going to end up there. So once the user selects that, in the Members
field, he will see which currently present server in the server inventory will actually
end up in that pool. So if he’s happy the results, if actually
this server should be there, basically he will create a…click Create Pool button and
the pool gets created. Once he has a pool, then he can proceed with
the Create Site user interface where he first has to select the role for which he wants
to create a site; by site I mean that’s like multiple virtual servers which are configured
in multiple availability zones at the same time. So basically the available…let’s say through
our infrastructure. Then once he selects a role, he can select
the pools he wants to use for that site, some other managed for attributes, and then click
the Validate button to actually validate if he’s happy with the changes which are going
to take place. If he is happy with that, he clicks the Create
Site button and his site gets created in a short while/ He basically can just use it
right away. One of the biggest features, which we developed
on top of HAProxy, which was actually developed as context-aware business logic, is smart
traffic routing. Smart traffic routing is business logic which
has three requirements in mind. So, the first requirement is that we want
traffic to be delivered whenever possible. Obviously we prefer always to go to the local
server. Due to the lowest latency, we always prefer
a local availability zone or to go to the local, let’s say, local servers; but if that’s
not possible, use the servers wherever they are, which I don’t care which availability
zone. Please deliver traffic to those servers if
they are healthy. It’s always better to serve slow responses
rather than returning errors to the customers. We aim to never fail. The second requirement is that we always want
even traffic distribution between different availability zones. This requirement comes from the fact that
if you’re dealing with lots of servers, then actually every…if Availability Zone 1 in
our example goes down and then Anycast will send it to the next closest availability zone. In this case, it will be Availability Zone
2. So that means this availability zone’s servers
will receive twice as much traffic. If you would evenly distributed it between
the remaining availability zones, these servers would…all servers will only receive 50%
of the traffic. So if you’re dealing with a lot of servers,
50% versus 100% means a lot of hardware. So this requirement tries to actually more
efficiently use your hardware and have lower capacity requirements for every availability
zone. And the last requirement is that we want our
clients to get sticky to a specific availability zone. This requirement comes from the fact that…is
due to the session data because when the client goes to…hits specific servers in specific
availability zones, session data gets stored locally and, as well, it gets synchronized
to the other availability zones. If he will flip between different availability
zones on per-request basis, there is a race condition between the data being copied…his
session data being copied from the previous request from a remote availability zone and
then the session data being stored for the current request, and then his session data
might get clobbered or some data might be overwritten. I mean our software can deal with that, but
we would rather avoid it rather than deal with it. In normal circumstances, traffic gets evenly
split between all three availability zones. Then it hits load balancing platform in that
availability zone and then hits servers. So this is like normal circumstances. In case of load balancing platform failure
in Availability Zone 1, thanks to Anycast traffic from that availability zone will hit
the load balancing platform in Availability Zone 2. Then load balancing platform in Availability
Zone 2 will see: This is traffic meant for Availability Zone 1 and servers in Availability
Zone 1 are up. In this case, it will forward traffic back
to Availability Zone 1. So in this case, the even traffic distribution
requirement is met. Obviously, we have extra latency for the traffic
going through Availability Zone 1, but everyone is happy I guess. In case of servers being down, similarly to
the previous diagram, traffic again will flow to the load balancing platform in Availability
Zone 2, but then load balancing platform in Availability Zone 2 will see that servers
Availability Zone 1 are down and then it will split this traffic evenly between all remaining
data centers. The traffic which goes directly to availability
zones 2 and 3 remains as is. We only now are basically splitting traffic
between availability zone 2 and 3, and as well we try to keep clients sticky to specific
availability zones. So these are the failure of scenarios and
a more common scenarios is this, which is similar to the previous one, where, actually,
load balancing platform is up but the servers are down. This could be due to outage being in Availability
Zone 1 or servers being…or a bad rollout there or somebody making…doing a failover
when he basically marks all these servers are down. In this case, the load balancing platform
in Availability Zone 1 will distribute this traffic among the remaining availability zones. So what are our current and future plans? Currently, we are in process of evaluating
into QuickAssist technology to do SSL acceleration. So we want HAProxy to do SSL offloading in
hardware. We already have a proof of concept and we
are, yeah, we are basically needing to run some tests and get some numbers to actually
see how much benefit we could get from the hardware acceleration. Our platform is HTTP/2 ready, but we ran into
some problems and we first need to sort it out before we enable it globally. Obviously, we’re looking forward for HTTP/3,
which is an upcoming new protocol; and, as well, one of the features we’re asking HAProxy
for is TCP Fast Open on the backend side because TCP on the frontend was available for quite
a while, but this is something we would love to try it because we have some setups when
actually the latency between the HAProxy and the backend server is relatively high and
this would allow us to significantly reduce connect time for all consecutive requests. And as well we want to provide path-based
routing in the Balancer user interface for role owners so role owners can actually define
the routing rules for their service, so how they want for a specific path where they want
traffic to go to. This would allow them to basically specify
their own through the user interface without much knowledge of, like, Balancer API voodoo. Through a simple way to define: for this path
go here, for this path go there etc. In our load balancing platform we use the
following open-source software. I have to mention that the bottom three, so
anything below Bird Internet Routing Daemon, was exclusively created for our load balancing
platform. You have to keep in mind that some of these
can be like partially obsolete or might be obsolete soon. This is due to the fact that when we created
our load balancing platform we were dealing with HAProxy 1.5, which was a while ago and
some of these functionalities…actually, you know, HAProxy moving forward and basically
listens to its customers and implementing some features which people are using.

About Ralph Robinson

Read All Posts By Ralph Robinson

Leave a Reply

Your email address will not be published. Required fields are marked *