Wednesday 21 February 2024

CDN is for everyone

 For my last gig, I had the opportunity to take part in the largest women's broadcast sport in Australia, The Women's World Cup 2023. It was amazing!


My job was a bit daunting: just make sure there was no repeat of 2016 (major outage, Prime Minister had to apologize, etc.). You know, no pressure.


So let's roll back a bit into the story before I started. At that time, the company was young and had set up a CDN (Content Delivery Network) in-house as they owned the distribution network. Everything was working fine, serving out the normal traffic to customers. Then along comes the Men's World Cup 2018, and it's one of those events where a few things happen and then snowball into a nightmare.


Customers were no longer able to watch matches. It goes on for a few days, and then the Prime Minister gets on telly and apologizes. In the end, it gets simulcast on free-to-air television.


So I guess, no, lots and lots of pressure to not have it go south this time.


My job started two years out from the event. By this stage, the existing CDN was still somewhat capable, but as happens, it was out of date in both tech and software. The first major hurdle was that the tech side (servers) were on very long lead times due to Covid and a few global logistic issues. Great, big project, and we might have issues with hardware.


In order to keep it all on track, the decision was made to rebuild the existing older hardware with the new stack. Step one was to take a region cache offline and have all traffic served from the remaining regions. This worked well due to the work being done by my co-workers. Now I had some servers to work with, so I went on to generate a custom kickstart installer that would be used for the old and new servers. It was fairly easy work to get a minimal setup with just enough of a system to stand in a firewall and set up some basic access. Next, I needed to have some configuration management. Ansible was selected and used as some of the team had prior knowledge. So I went through the steps of creating all the basic modules you need to set up the firewalls and users as per company policies.


Now onto the really interesting bits that I have never done before. Building a CDN, not just building but containerizing them.


A TLS terminator was needed, and for that, we selected H2o from Fastly, as H3/quic was just coming out and it sounded like something that we could use. Didn't it come back and haunt us later (keep reading for how)?


Then the cache software. This is Varnish-cache. I was interested when building and deploying Varnish to work out how you can give large amounts of ram to a container and then be caught out by the most unexpected things. What caught me out was the "Transient" cache. Its job is to store what should be short-term objects like 404 response pages. What you expect from the name is that it goes into the cache for a short amount of time and then gets purged. Here's where it got me: unless you give a command-line option to limit its size, it will grow until it exhausts memory. Well, not quite. I suspect in an OS install it goes until some % of memory is used, but in container land, it just goes until it won't work anymore and Varnish crashes. At least Podman restarts the container at this point, so the outage is only a blip, but that also wipes the cache.


By this stage, we are down to under a year to kick out. The old infrastructure is up to date and has been running the code for a few months. The new hardware is starting to come in, and the required changes to suit integrated into the mix. All is well. We even start to deploy to some site, and then it hits... One day we notice that our edge caches are bouncing in and out. Start looking into the logs and find it's something with our TLS terminator. Then see the weird request headers just prior to the issue occurring. What we discovered is that a malicious header that is being sent by some script kiddie can take it all down, actually not all of it but just the TLS terminator in such a way that the process stays up but all requests black hole. Luckily, the DNS health checks we run see this and drop the node from the cluster. This just moves the traffic to another node, so while not good, it is not as bad as it could be (what do you know, it's not DNS).


Now the nice part of this was that it was found Monday, fixed on Tuesday, and passed for and released Wednesday.


So now you're asking, why did I call it "CDN for everyone"?


Throughout this journey, I got to do a lot of things and see how CDNs actually work under the covers, including most of the big players in the game. I saw a lot of good things and some not so good. I got to see how what is being sent from upstream sources (origins) and equally their sources matter.


My biggest single takeaway was to have your own origin and control how and where things are cached. If you do not or cannot cache your website, API, or content in-house, then do not expect it to be cached correctly in a 3rd party CDN.


Lately, I have seen a bit of a trend where companies are serving their products using a framework, but no one knows how the caching is done. This means that it's caching either not at all or badly (normally the first). Then if they serve via 3rd party, then once again, they either manually set up caching rules there that are not easily replicated to another 3rd party, or they do not do anything to set cache control headers. This just leaves you with a cost of serving everything to everyone with next to no cache benefit.


By setting up an in-house origin, you get to control the cache and therefore the experience the end-user has of your site. Moreover, the way the origin cache acts is the same way the edges will, and it also saves on configuration for those edges.

Friday 16 February 2024

Container Size Matters

I have an issue

I have 2 containers and I need them to talk, sounds easy right
But these containers only have ports for "listening" on and so cannot be setup directly to talk to one another. The normal got here is to use NetCat. I good and can you can pipe data between the to apps.

The problem comes when you want to put it into a container. NetCat requires a full OS to work so you start with a base container of some description, maybe alpine, maybe Ubuntu or my favourite of Rocky Linux. But this gives you a 100Mb image at best, or worse 500Mb.

I just not going to cut it.

The solution is easier that you think, native golang

But why golang

Golang has the ability to compile to native executable code.

take a look at this Dockerfile

FROM golang:1.20.0 as exporter
ENV GO111MODULE=on
WORKDIR /app
COPY . .
RUN go get ./
RUN CGO_ENABLED=0 GOOS=linux go build -o bin/dump1090-netcat ./

FROM scratch
COPY --from=exporter /app/bin/dump1090-netcat /usr/local/bin/
CMD ["/usr/local/bin/dump1090-netcat"]
First we start off with a full "fat" container for doing the build in
When the executable is built you need to make sure it builds with the "CGO_ENABLE=0" flag so that it will not try to link to any libs in the container. Without this the final stage wont work

The last part is "FROM scratch", this tells docker that there is no file structure nor is there a base image for here on out
Then we copy in the binary from the exporter stage and set it as the command

The resulting image is only as big as the executable but has all the functionality
In this case I pass in several environment options to the container indicating source IP and port and then the destination IP and port and it connects to the source and send to the destination.

No pipes required 
(code) (container)

ADS-B and Tracking Flights

 After 2 and a bit years working for Optus Sport constructing a CDN ( Content Delivery Network ) my contract finished up. With the Australian job market slowing down for Christmas I took the opportunity to learn some new skills and what better way than a project I always wanted to do 😊

For many years I had been watching aircraft using a software called Dump1090 to receive ADS-B transponder data. This was OK to start with as it had a very simplistic web interface that overlays the location and direction for the aircraft. 

But I always found it lacking, no track data. I wanted more and so it was time to go on a bit of a software adventure.

Now I am no software dev ( yes I still say that even thought I write more code than most, just ask some of my old managers ).

The Radio

First was to setup an old laptop with the latest version of dump1090 and as I don't like installing lots of things I went with setting up a docker container.

FROM rockylinux:9 as builder
RUN sed 's/https/http/g' -i /etc/yum.repos.d/* \
  && dnf install epel-release -y \
  && sed 's/https/http/g' -i /etc/yum.repos.d/* \
  && dnf install rtl-sdr-devel git make gcc -y \
  && git clone https://github.com/antirez/dump1090.git \
  && cd dump1090  \
  && make

FROM rockylinux:9
COPY --from=builder /dump1090 /dump1090
RUN sed 's/https/http/g' -i /etc/yum.repos.d/* \
  && dnf install epel-release -y \
  && sed 's/https/http/g' -i /etc/yum.repos.d/* \
  && dnf install rtl-sdr -y
EXPOSE 8081 30003
WORKDIR /dump1090   

You can see I like using multi stage container builds so that I can get consistent builds and make them a lot smaller 

This image was then pushed up to docker hub as  dump1090
When this is installed with a volume to /dev/bus/usb it will have the ability to receive the ADS-B signals

Database connector

Next on the list was how to store the data...
I settled on a Postgres database backend and a single table as Dump1090 output a format called "Base Station Mode". I this mode all the data is formatted in a comma separated string with 22 fields. Nice and easy for a little golang container ( code ) ( container )

API's

This good but has a major down side, the second field is the message type and not all fields have values  on every row. This make it so that to the only way to find a flight callsign for a location is to cross reference the hex_ident (ICAO id)  from message where the message type is 3 and include the  timestamp , the match that to message type 1 from the same table.... this all gets messy after a while

After working all this out I decided to move all the the data joins and as much as possible everything I could to database views. I mean the DB is good at data manipulations and it can cache better, sort better etc. So why not just use it.
This has lead to a bit of explosion in the views I went thru and tested different things, but its fast with over 11G of data in 5.3 million rows

Now to the real API's, there are on 3 and that could be shrunk more.
  1. api/adsb/position returns a list of aircraft seen in the last 10 seconds with latitude, longitude and altitude and is used for both the marker icon on the map and the table data for flights

  2. api/adsb/points/{flight}/{hex_ident} returns all the locations the aircraft was seen at over the last 2 hours used for displaying the track when an aircraft is selected

  3. api/adsb/track/{hex_ident} returns the current track of the aircraft use to set the rotation of the aircraft marker icon

Web UI

This was the most interesting to me as I am not a visual person... I just don't understand how to make things look nice but I can make it functional ( I hope )
I the past I have used VueJS and it just kinda clicked for me and this was when I was working with a ReactJS dev team, still don't understand React
Since my last adventure in UI, VueJS has moved onto Vue3 but this gave me no end of issues with the leaflet component and making it reactive to change. So I went back to old faithful v2 and it all started to work.
Getting aircraft markers to show up was easy thanks to the api call that just gave them to me. Then the table was also easy. Making it auto-resize based on the number of rows, not to difficult until I found that window resize broke as I had made a static size for the rows. Easy fix with with an iteration over
const row = this.$el.querySelectorAll("tr")
This gave me a way to find the height of each row and make the table dynamic

Next was to make a way to select an aircraft and show its track, bonus point for highlighting the row in the table.

The first was easy, just call the api based on the aircraft call-sign and hex_ident. Then realise that I need to remove the layer when the flight disappeared. 
Turns out the easiest way is to have the api return null if there are no point newer than 10 seconds, make for interesting watching when the flight is in a bad reception area and it keeps "flashing" but I can live with that

Next the row highlight. Initially I was was finding the array index from the list of flights, but this was only looked up when the track was toggled. As flight moved in the table they would be highlighted even tho they were not the one show its track
So next thought, lets use call-sign, sounds  good right?  The problem is that some flight transmit different call-signs as they are legs for differing flights, this a flight get a connection to another flight etc. This lead the table and the flight you select to "disappear" as its call-sign was unseen for a while
To fix this I changed over hex_idents as they are stable


Here is the result https://adsb.globelock.org/



(code) (container)

Where to from here

Just before I started this adventure I was working a GPS via LoRa ( not LoRaWAN ) and I was finding that the packet size was just killing the radio
My plan is to create a ADS-B transponder that will sent raw packets via LoRa to a base station. This should help as ADS-B is already a very compact communication protocol