Skip to main content

One post tagged with "websocket"

View All Tags

· 3 min read
Lucas Weis Polesello

One of the most interesting career challenges I've faced was something trivial in the world of stateless services but challenging in stateful - enabling WebSocket instances to scale horizontally.

This challenge comes in many flavors and ours had some constraints:

  • It had to respect our internal framework - listening por model events
  • We had to apply IAM filtering
  • Had to use SocketIO
  • SocketIO plugins like RabbitMQ were not valid. Team judge as costly.
  • Redis Plugins were not fit.
  • We had to support multi-tab
  • No infrastructure involved

Basically our WebSocket servers ran with old versions of SocketIO and had they very poor usage of its benefits, to say the least. It could've been a simple WebSocket server.

To scale it horizontally, we decided to use Redis PubSub - by simply allowing clients-server communicate via Redis PubSub.

thumbnail-ws-operator-blogpost

End of Project and I learned something very important

And that is choosing to scale WebSocket was a bad idea by itself. As intrinsic to distributed systems, problems like:

  • Observability
  • Redis PubSub deliverability issues and Network bandwidth
  • It lacked re-balancing connections - hot replicas vs cold replicas by amount of conns.

But I had so many limitations that I crossed my mind:

What If I could just use the underlying environment - aka Kubernetes - for this kind of stuff? Some refined load balancing? Proper routing of messages via proxy? (TBH a simple RMQ would've done the job so far)

Considering that I never-ever stopped dreaming about a better design for this problem

By consequence 2y ears later I stumbled upon this article which described - beautifully - how they solved scalability for a WebSocket stateful app.

It motivated me into a crazy journey: If I solved the same issue, if they solved the same issue and we had similar ideas. How many people are solving this same challenge?

Introducing you websocket-operator

ws-operator-poc-diagram

In this blueprint and not-yet-prod-deployed-oss-project I've mixed two main ingredients:

  • The need to learn heavier GO and Kubernetes Operator.
  • A problem I've already solved - but now with no limitations

The Operator consists of three main components - and yes they are very similar to those from the article.

  • A LoadBalancer
    • A end-user exposed API that accepts connections and routes to proper proxy-sidecars.
    • It applies a distributed load balancing algorithm - shared with the Proxy SideCar
    • It uses a Kubernetes Service Discovery to check for available IP's to load-balancer.
  • Proxy SideCar
    • Intercept WS Messages and proxies via HTTP to another Proxy SideCar that may have the connection for such recipient.
    • It shares the same algorithm from LoadBalancer - they can find themselves
  • Controller
    • Injects the SideCar in Deployments/Pods with ws.operator/enabled label
    • Re-Balances connected users.

Roadmap

This is pretty much inspired in the articles Signaling Servers design and has some interesting features in the Roadmap:

  • Pluggable Hashing Algorithm for Routing
    • Plug your own algorithm to load balance connections
  • Pluggable Routing
    • v0.0.1 will assume WS is exchanging JSON messages - but they could be RAW binaries, or just simple text with their own protocol.
  • Support Broadcasting

TakeAway

  • There's a intelectual value in reinventing the wheel
  • Do not scale stateful apps unless very needed
  • And if you need, reconsider it again. You might be safer just oversizing infrastructure
  • Ok, you really need it. Study, investigate, research and well, feel free to benchmark this plug and play solution.