2 posts tagged with "luma"

The Tale of How We Compensated Years of Tech Debt in a Month: Performance, Cost and Optimizations

March 31, 2025 · 4 min read

SRE | Senior Software Engineer @ LumaHealth

Back in 2022 @Luma had a major outage which caused hours of downtime, angry customers and lots of engineering efforts to return the product back to normal. One of the discoveries made at that time was a hard truth: we modeled our internal IAM object poorly.

inb4: PHI

On the HealthCare Tech Industry - and outside of it - Patient Health Information (PHI) is of major importance. When applied to Luma's platform - being very brief - states that patient's data should belong only to facilities that they had interaction with/allowed them to have. This is to protect patient's health information and access to their data. It's a US-government enforced law.

IAM Model

Our IAM model is something called session object - a pretty simple concept - it concentrates the user's token, settings, groups, facilities and more metadata about itself. We use this session throughout all backend components of Luma to properly apply this PHI filtering rule. One of the bad decision back then was to simply pull all facilities and groups inside a JSON object and cache it. But then you would probably ask...

What's the problem of it?

When Luma grown it's scale - we started onboarding bigger and bigger customers - with their own setup of account leading to N different use-cases. Summing up some very creative account setups and huge customers - we ended up creating something unexpected - session object storing up to 2.6MB of pure JSON text.

And yes, you did read it right - an entire PDF!!

Now imagine that for each job, we actually pulled cached sessions up to 100x - or even more. Luma produces average of 2K jobs per second spiking up to 10K

That's ALOT of Network usage. - easily surpassing 2GB per second - aka more than 15Gbps. For reference - cache.m6g.8xlarge which is a fair-sized cache instance has this bandwidth.

Infrastructure Impact

All of a sudden Luma has this scenario:

Slow and unstable HTTP APIs
Had to oversize more than usual to handle same load - a very low overall ~400RPS
Had to split our Broker instance into smaller and focused one's
Had to increase our cache size
We had to create one thing called pubsub-service (yes and yes, no bueno) to offload services of slow publishes.
With this - we also created a jobless feature which forced backend components to publish only the jobId and route the job content itself via Redis.

The Business Impact? Money being thrown away and unsatisfied customers

Given it's mission-critical importance of Session - the risks were just too high until 1st of March 2025. After good months of investigation, searching code, trying to run AST analysis (codemod) in almost the entire codebase and libraries - it sounded like we had a solution in-mind.

A simple feature flag - that would do `Query.select(_id)` when querying `Groups` - building the session with less data.

Although simple and sustained by a lot of research we were still cautious by setting up lots of product metrics and log metrics to understand wether rolled customers would have negative impact - like messaging not going out, notifications being missed or even worse - entire product breakdown.

Rollout and Implementation

We needed to ensure our libraries were at least a certain version
Enable flag for each customer tenant id

Outcome

It took us nearly a month rolling out 90 backend services and a week enabling the flag for all customers. But the results are very expressive.

60% Network usage reduction. It's been weeks we don't have any alerts about Network Bandwidth
Stable API latencies. We are now able to downsize our infrastructure back to normal levels - we are estimating to downsize REST layer resources by 1/4
Almost zero'ed PubSub bandaid network usage. We are now unblocked to remove the bandaid solutions like pubsub-service and sessionless code

REST Layer p75 to p95 stability and latencies drop after 03/11 - first customers.

rest-latencies

Comparing pre-rollout weeks(03/17) to post-rollout (03/17) - Check the end of the graph monthly-bandwidth-usage

Post-Rollout Monday (Busiest day of week) busy-monday-after-rollout Pre-Rollout Monday (Busiest day of week) busy-monday-pre

TakeAway Points

There's much more coming in the future - but we are happy to finally unblock the road for bigger impact optimizations.

To build good product, find market-fit, prioritize customers and market requirements is an art of business but I deeply think that there's some bounding between business and this not-so-celebrated-kind-of-stuff.

At the end of the day - delivering a reliable, stable and ever-growing platform requires revisiting past decisions - behind a healthy and stable platform is a great patient experience and efficient staff.

Redis is more than a Cache - Delaying Jobs

March 15, 2024 · 4 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

My current company - Luma Health Inc - has an Event-Driven Architecture where all of our backend systems interact via async messaging/jobs. Thus our backbone is sustained by an AMQP broker - RabbitMQ - which routes the jobs to interested services.

Since our jobs are very critical - we cannot support failures AND should design to make the system more resilient because well..we don't want a patient not being notified of their appointment, appointments not being created when they should, patients showing off into facilities where they were never notified the patient had something scheduled.

Besides the infra and product reliability - some use cases could need postponing - maybe reaching out to an external system who's offline/or not responding. Maybe some error which needs a retry - who knows?

The fact is, delaying/retry is a very frequent requirement into Event Driven Architectures. With this a service responsible for doing it was created - and it worked fine.

But - as the company sold bigger contracts and grew up in scale - this system was almost stressed out and not reliable.

The Unreliable Design

Before giving the symptoms, let's talk about the organism itself - the service old design.

The design was really straightforward - if our service handlers asked for a postpone OR we failed to send the message to RabbitMQ - we just insert the JSON object from the Job into a Redis Sorted Set and using the Score as the timestamp which it was meant to be retried/published again.

To publish back into RabbitMQ the postponed messages, a job would be triggered each 5 seconds - doing the following:

Read from a set key containing all the existing sorted set keys - basically the queue name
Fetch jobs using zrangebyscore from 0 to current timestamp BUT limit to 5K jobs.
Publish the job to RabbitMQ and remove it from sorted set

The Issues

This solution actually scaled up until 1-2 years ago when we started having issues with it - the main one's being:

It could not catch up to a huge backlog of delayed messages
It would eventually OOM or SPIKE up to 40GB of memory
1. Due to things being fetched into memory AND some instability OR even some internal logic - we could end up shoveling too much data into Redis - the service just died 💀
We could not scale horizontally - due to consuming and fetching objects into memory before deleting them.

The Solution

The solution was very simple: we implemented something that I liked to call streaming approach

Using the same data structure, we are now:

Running a zcount from 0 to current timestamp
- Counting the amount of Jobs -> returning N
Creating an Async Iterator for N times - that used the zpopmin method from Redis
- zpopmin basically returns AND removes the least score object - ie most recent timestamp

The processor for the SortedSet process-delayed-jobs-code

The Async Iterator job-scan-iterator

And that's all!

This simple algorithm change annihilated the need for:

Big In Memory fetches - makes our memory allocation big
Limit of 5K in fetches - makes our throughput lower

Results - which I think the screenshots can speak for themselves

We processed the entire backlog of 40GB of pending jobs pretty quickly aws-memory-consumption

From a constant usage of ~8GB - we dropped down to ~200MB memory-delayed-jobs We are now - trying to be play safe and still oversize - safely allocating 1/4 of the resources. The git diff was our first resource dump - we went even further.

git-diff-memory-delayed-jobs

Money-wise: We are talking at least of 1K USD/month AND more in the future if we can lower our ElastiCache instance.

Take Away Points

Redis is a Distributed DataStructure database - more than just a simply KeyValue pair storage.
You can achieve great designs using Redis
Be careful because the way you design a solution with Redis may be costly in the future

Final Thoughts

There could be a lot of discussion wether this is a great way of doing jobs postponing, if Redis is the right storage, if we should really postpone jobs for small network hiccups, shouldn't we leverage DelayedExchange from Rabbit? - etc... But at the end of the day - to succeed as a company we need to solve problems in our daily routine. Some problems are worth, some are not. It's always - one step at a time.

inb4: PHI​

IAM Model​

What's the problem of it?​

Infrastructure Impact​

The Business Impact? Money being thrown away and unsatisfied customers​

A simple feature flag - that would do Query.select(_id) when querying Groups - building the session with less data.​

Rollout and Implementation​

Outcome​

TakeAway Points​

The Unreliable Design​

The Issues​

The Solution​

Results - which I think the screenshots can speak for themselves​

Take Away Points​

Final Thoughts​