For my last gig, I had the opportunity to take part in the largest women's broadcast sport in Australia, The Women's World Cup 2023. It was amazing!
My job was a bit daunting: just make sure there was no repeat of 2016 (major outage, Prime Minister had to apologize, etc.). You know, no pressure.
So let's roll back a bit into the story before I started. At that time, the company was young and had set up a CDN (Content Delivery Network) in-house as they owned the distribution network. Everything was working fine, serving out the normal traffic to customers. Then along comes the Men's World Cup 2018, and it's one of those events where a few things happen and then snowball into a nightmare.
Customers were no longer able to watch matches. It goes on for a few days, and then the Prime Minister gets on telly and apologizes. In the end, it gets simulcast on free-to-air television.
So I guess, no, lots and lots of pressure to not have it go south this time.
My job started two years out from the event. By this stage, the existing CDN was still somewhat capable, but as happens, it was out of date in both tech and software. The first major hurdle was that the tech side (servers) were on very long lead times due to Covid and a few global logistic issues. Great, big project, and we might have issues with hardware.
In order to keep it all on track, the decision was made to rebuild the existing older hardware with the new stack. Step one was to take a region cache offline and have all traffic served from the remaining regions. This worked well due to the work being done by my co-workers. Now I had some servers to work with, so I went on to generate a custom kickstart installer that would be used for the old and new servers. It was fairly easy work to get a minimal setup with just enough of a system to stand in a firewall and set up some basic access. Next, I needed to have some configuration management. Ansible was selected and used as some of the team had prior knowledge. So I went through the steps of creating all the basic modules you need to set up the firewalls and users as per company policies.
Now onto the really interesting bits that I have never done before. Building a CDN, not just building but containerizing them.
A TLS terminator was needed, and for that, we selected H2o from Fastly, as H3/quic was just coming out and it sounded like something that we could use. Didn't it come back and haunt us later (keep reading for how)?
Then the cache software. This is Varnish-cache. I was interested when building and deploying Varnish to work out how you can give large amounts of ram to a container and then be caught out by the most unexpected things. What caught me out was the "Transient" cache. Its job is to store what should be short-term objects like 404 response pages. What you expect from the name is that it goes into the cache for a short amount of time and then gets purged. Here's where it got me: unless you give a command-line option to limit its size, it will grow until it exhausts memory. Well, not quite. I suspect in an OS install it goes until some % of memory is used, but in container land, it just goes until it won't work anymore and Varnish crashes. At least Podman restarts the container at this point, so the outage is only a blip, but that also wipes the cache.
By this stage, we are down to under a year to kick out. The old infrastructure is up to date and has been running the code for a few months. The new hardware is starting to come in, and the required changes to suit integrated into the mix. All is well. We even start to deploy to some site, and then it hits... One day we notice that our edge caches are bouncing in and out. Start looking into the logs and find it's something with our TLS terminator. Then see the weird request headers just prior to the issue occurring. What we discovered is that a malicious header that is being sent by some script kiddie can take it all down, actually not all of it but just the TLS terminator in such a way that the process stays up but all requests black hole. Luckily, the DNS health checks we run see this and drop the node from the cluster. This just moves the traffic to another node, so while not good, it is not as bad as it could be (what do you know, it's not DNS).
Now the nice part of this was that it was found Monday, fixed on Tuesday, and passed for and released Wednesday.
So now you're asking, why did I call it "CDN for everyone"?
Throughout this journey, I got to do a lot of things and see how CDNs actually work under the covers, including most of the big players in the game. I saw a lot of good things and some not so good. I got to see how what is being sent from upstream sources (origins) and equally their sources matter.
My biggest single takeaway was to have your own origin and control how and where things are cached. If you do not or cannot cache your website, API, or content in-house, then do not expect it to be cached correctly in a 3rd party CDN.
Lately, I have seen a bit of a trend where companies are serving their products using a framework, but no one knows how the caching is done. This means that it's caching either not at all or badly (normally the first). Then if they serve via 3rd party, then once again, they either manually set up caching rules there that are not easily replicated to another 3rd party, or they do not do anything to set cache control headers. This just leaves you with a cost of serving everything to everyone with next to no cache benefit.
By setting up an in-house origin, you get to control the cache and therefore the experience the end-user has of your site. Moreover, the way the origin cache acts is the same way the edges will, and it also saves on configuration for those edges.
No comments:
Post a Comment