Skip to main content

2 posts tagged with "blogpost"

View All Tags

· 4 min read
Lucas Weis Polesello

Back in 2022 @Luma had a major outage which caused hours of downtime, angry customers and lots of engineering efforts to return the product back to normal. One of the discoveries made at that time was a hard truth: we modeled our internal IAM object poorly.

inb4: PHI

On the HealthCare Tech Industry - and outside of it - Patient Health Information (PHI) is of major importance. When applied to Luma's platform - being very brief - states that patient's data should belong only to facilities that they had interaction with/allowed them to have. This is to protect patient's health information and access to their data. It's a US-government enforced law.

IAM Model

Our IAM model is something called session object - a pretty simple concept - it concentrates the user's token, settings, groups, facilities and more metadata about itself. We use this session throughout all backend components of Luma to properly apply this PHI filtering rule. One of the bad decision back then was to simply pull all facilities and groups inside a JSON object and cache it. But then you would probably ask...

What's the problem of it?

When Luma grown it's scale - we started onboarding bigger and bigger customers - with their own setup of account leading to N different use-cases. Summing up some very creative account setups and huge customers - we ended up creating something unexpected - session object storing up to 2.6MB of pure JSON text.

And yes, you did read it right - an entire PDF!!

Now imagine that for each job, we actually pulled cached sessions up to 100x - or even more. Luma produces average of 2K jobs per second spiking up to 10K

That's ALOT of Network usage. - easily surpassing 2GB per second - aka more than 15Gbps. For reference - cache.m6g.8xlarge which is a fair-sized cache instance has this bandwidth.

Infrastructure Impact

All of a sudden Luma has this scenario:

  • Slow and unstable HTTP APIs
  • Had to oversize more than usual to handle same load - a very low overall ~400RPS
  • Had to split our Broker instance into smaller and focused one's
  • Had to increase our cache size
  • We had to create one thing called pubsub-service (yes and yes, no bueno) to offload services of slow publishes.
  • With this - we also created a jobless feature which forced backend components to publish only the jobId and route the job content itself via Redis.

The Business Impact? Money being thrown away and unsatisfied customers

Given it's mission-critical importance of Session - the risks were just too high until 1st of March 2025. After good months of investigation, searching code, trying to run AST analysis (codemod) in almost the entire codebase and libraries - it sounded like we had a solution in-mind.

A simple feature flag - that would do Query.select(_id) when querying Groups - building the session with less data.

Although simple and sustained by a lot of research we were still cautious by setting up lots of product metrics and log metrics to understand wether rolled customers would have negative impact - like messaging not going out, notifications being missed or even worse - entire product breakdown.

Rollout and Implementation

  • We needed to ensure our libraries were at least a certain version
  • Enable flag for each customer tenant id

Outcome

It took us nearly a month rolling out 90 backend services and a week enabling the flag for all customers. But the results are very expressive.

  • 60% Network usage reduction. It's been weeks we don't have any alerts about Network Bandwidth
  • Stable API latencies. We are now able to downsize our infrastructure back to normal levels - we are estimating to downsize REST layer resources by 1/4
  • Almost zero'ed PubSub bandaid network usage. We are now unblocked to remove the bandaid solutions like pubsub-service and sessionless code

REST Layer p75 to p95 stability and latencies drop after 03/11 - first customers.

rest-latencies

Comparing pre-rollout weeks(03/17) to post-rollout (03/17) - Check the end of the graph monthly-bandwidth-usage

Post-Rollout Monday (Busiest day of week) busy-monday-after-rollout Pre-Rollout Monday (Busiest day of week) busy-monday-pre

TakeAway Points

There's much more coming in the future - but we are happy to finally unblock the road for bigger impact optimizations.

To build good product, find market-fit, prioritize customers and market requirements is an art of business but I deeply think that there's some bounding between business and this not-so-celebrated-kind-of-stuff.

At the end of the day - delivering a reliable, stable and ever-growing platform requires revisiting past decisions - behind a healthy and stable platform is a great patient experience and efficient staff.

· 3 min read
Lucas Weis Polesello

One of the most interesting career challenges I've faced was something trivial in the world of stateless services but challenging in stateful - enabling WebSocket instances to scale horizontally.

This challenge comes in many flavors and ours had some constraints:

  • It had to respect our internal framework - listening por model events
  • We had to apply IAM filtering
  • Had to use SocketIO
  • SocketIO plugins like RabbitMQ were not valid. Team judge as costly.
  • Redis Plugins were not fit.
  • We had to support multi-tab
  • No infrastructure involved

Basically our WebSocket servers ran with old versions of SocketIO and had they very poor usage of its benefits, to say the least. It could've been a simple WebSocket server.

To scale it horizontally, we decided to use Redis PubSub - by simply allowing clients-server communicate via Redis PubSub.

thumbnail-ws-operator-blogpost

End of Project and I learned something very important

And that is choosing to scale WebSocket was a bad idea by itself. As intrinsic to distributed systems, problems like:

  • Observability
  • Redis PubSub deliverability issues and Network bandwidth
  • It lacked re-balancing connections - hot replicas vs cold replicas by amount of conns.

But I had so many limitations that I crossed my mind:

What If I could just use the underlying environment - aka Kubernetes - for this kind of stuff? Some refined load balancing? Proper routing of messages via proxy? (TBH a simple RMQ would've done the job so far)

Considering that I never-ever stopped dreaming about a better design for this problem

By consequence 2y ears later I stumbled upon this article which described - beautifully - how they solved scalability for a WebSocket stateful app.

It motivated me into a crazy journey: If I solved the same issue, if they solved the same issue and we had similar ideas. How many people are solving this same challenge?

Introducing you websocket-operator

ws-operator-poc-diagram

In this blueprint and not-yet-prod-deployed-oss-project I've mixed two main ingredients:

  • The need to learn heavier GO and Kubernetes Operator.
  • A problem I've already solved - but now with no limitations

The Operator consists of three main components - and yes they are very similar to those from the article.

  • A LoadBalancer
    • A end-user exposed API that accepts connections and routes to proper proxy-sidecars.
    • It applies a distributed load balancing algorithm - shared with the Proxy SideCar
    • It uses a Kubernetes Service Discovery to check for available IP's to load-balancer.
  • Proxy SideCar
    • Intercept WS Messages and proxies via HTTP to another Proxy SideCar that may have the connection for such recipient.
    • It shares the same algorithm from LoadBalancer - they can find themselves
  • Controller
    • Injects the SideCar in Deployments/Pods with ws.operator/enabled label
    • Re-Balances connected users.

Roadmap

This is pretty much inspired in the articles Signaling Servers design and has some interesting features in the Roadmap:

  • Pluggable Hashing Algorithm for Routing
    • Plug your own algorithm to load balance connections
  • Pluggable Routing
    • v0.0.1 will assume WS is exchanging JSON messages - but they could be RAW binaries, or just simple text with their own protocol.
  • Support Broadcasting

TakeAway

  • There's a intelectual value in reinventing the wheel
  • Do not scale stateful apps unless very needed
  • And if you need, reconsider it again. You might be safer just oversizing infrastructure
  • Ok, you really need it. Study, investigate, research and well, feel free to benchmark this plug and play solution.