One of the most interesting career challenges I've faced was something trivial in the world of stateless services but challenging in stateful - enabling WebSocket instances to scale horizontally.
This challenge comes in many flavors and ours had some constraints:
- It had to respect our internal framework - listening por model events
- We had to apply IAM filtering
- Had to use SocketIO
- SocketIO plugins like RabbitMQ were not valid. Team judge as costly.
- Redis Plugins were not fit.
- We had to support multi-tab
- No infrastructure involved
Basically our WebSocket servers ran with old versions of SocketIO and had they very poor usage of its benefits, to say the least. It could've been a simple WebSocket server.
To scale it horizontally, we decided to use Redis PubSub - by simply allowing clients-server communicate via Redis PubSub.
End of Project and I learned something very important
And that is choosing to scale WebSocket
was a bad idea by itself. As intrinsic to distributed systems, problems like:
- Observability
- Redis PubSub deliverability issues and Network bandwidth
- It lacked re-balancing connections - hot replicas vs cold replicas by amount of conns.
But I had so many limitations that I crossed my mind:
What If I could just use the underlying environment - aka Kubernetes - for this kind of stuff? Some refined load balancing? Proper routing of messages via proxy? (TBH a simple RMQ would've done the job so far)
Considering that I never-ever stopped dreaming about a better design for this problem
By consequence 2y ears later I stumbled upon this article which described - beautifully - how they solved scalability for a WebSocket stateful app.
It motivated me into a crazy journey: If I solved the same issue, if they solved the same issue and we had similar ideas. How many people are solving this same challenge?
Introducing you websocket-operator
In this blueprint and not-yet-prod-deployed-oss-project
I've mixed two main ingredients:
- The need to learn heavier GO and Kubernetes Operator.
- A problem I've already solved - but now with no limitations
The Operator consists of three main components - and yes they are very similar to those from the article.
- A LoadBalancer
- A end-user exposed API that accepts connections and routes to proper proxy-sidecars.
- It applies a distributed load balancing algorithm - shared with the
Proxy SideCar
- It uses a Kubernetes Service Discovery to check for available IP's to load-balancer.
- Proxy SideCar
- Intercept WS Messages and proxies via HTTP to another
Proxy SideCar
that may have the connection for such recipient. - It shares the same algorithm from LoadBalancer - they can find themselves
- Intercept WS Messages and proxies via HTTP to another
- Controller
- Injects the SideCar in Deployments/Pods with
ws.operator/enabled
label - Re-Balances connected users.
- Injects the SideCar in Deployments/Pods with
Roadmap
This is pretty much inspired in the articles Signaling Servers
design and has some interesting features in the Roadmap:
- Pluggable Hashing Algorithm for Routing
- Plug your own algorithm to load balance connections
- Pluggable Routing
- v0.0.1 will assume WS is exchanging JSON messages - but they could be RAW binaries, or just simple text with their own protocol.
- Support Broadcasting
TakeAway
- There's a intelectual value in reinventing the wheel
- Do not scale stateful apps unless very needed
- And if you need, reconsider it again. You might be safer just oversizing infrastructure
- Ok, you really need it. Study, investigate, research and well, feel free to benchmark this plug and play solution.