Skip to main content

4 posts tagged with "redis"

View All Tags

· 5 min read
Lucas Weis Polesello

As engineers we are very used to manipulating data in well-know structures like arrays, queues, sets and maps as they solve day-to-day issues - like when you need to manipulate relationships from database on the client-side, you use Sets/Maps to prevent O(nˆ2)searches.

But one of the not-so common data structures is the Bitmap - being very math-focused and tailored for certain scenarios - it's usually forgotten.

Today we are going to showcase one scenario where it might be interesting to use it.

The Problem with Access Control

In many SaaS platforms - the ACL model usually revolves around many-to-many relationships, as example - imagine you have a software that solves for real-time communication such as Slack but give it a bit more of spiciness (as if Slack needs it?) - give it 1 more layer of control.

You would have an User U that belongs to Groups G. And each Group has it's own set of Teams T. Each sent message for a Team T should be Broadcasted to all online Users that have access to Team T.

With this - you do have a problem where you need to ensure that messages CANNOT be routed to wrong recipients and other Teams may join the same conversation.

The Challenge Is About Scale

You can scale this by simply doing database queries - up to a certain scale this is fairly simple. A query would hardly reach >50ms if well-indexed. Even if it was poorly optimized wouldn't be a huge problem. Nonetheless, maybe even the amount of notifications is not a problem!

But We Like To Pretend Windmills are Dragons

The real challenge is: What if, either lucky or the biggest salesman ever, your company reaches a huge milestone and now scale hits upon the backdoor

The Scenario:

Out of a sudden now you are handling 30K+ notifications per second. Your engineering team, knowing the domain of the product, designed almost all modeling and infrastructure with this in mind. But yet hammering the database 30K+ times per second is far from ideal - although injecting$$$ solves it - we like to pretend people care for green solutions in this blog.

At a certain day - a Slack thread arrives:

Engineer 1: Since the last Go-live our database system is just crawling man... I don't know how long we can sustain this.
Engineer 2: Yeah... We jumped from 10K Messages/Sec to more than 30K - no wonder it just got hammered
Engineer 1: I think it's time to optimize this but that might be hard - maybe some caching layer?
Engineer 2: Yeah, caching could be the way - but how?
Engineer 1: I have no idea to be honest. it might be too much data for Redis?
Engineer 3: Yo I am pretty sure some has solved this...
CTO: I think we can use Bitmaps for that ?

BitMaps and Access Control

Finally we reached the main part of this article - Bitmaps - are data structures specialized in binary operations - being fast and space-efficient. By such they efficiently tell us wether something is ON/OFF or TRUE/FALSE.

The Design

Each Tenant would have all of its Teams lay'ed out in Redis Bitmaps such as:

teams:<tenantId>:<tid>
online-users:<tenantId>

(Assuming you have de-normalized this relations due to it's read-pattern) Instead of doing:

SELECT
ug.user
FROM
groups ug
WHERE
ug.teams IN (<list of teams>)
AND ug.tenantId = <tenantId>

You would simply use a small LuaScript:

local tenantId = ARGV[1]
local tmp_or_key = 'tmp:teams'
local notify_key = 'tmp:notification'
local online_key = 'online-users:' + tenantId

-- Store tmp key for all SET bitmaps using OR operator
redis.call('BITOP', 'OR', tmp_or_key, unpack(KEYS))

-- Store the key with the
redis.call('BITOP', 'AND', notify_key, tmp_or_key, online_key)

-- Extract those Ids
local max_index = redis.call('STRLEN', notify_key) * 8
local notified_users = {}
for i = 0, max_index - 1 do
local bit = redis.call('GETBIT', notify_key, i)
if bit == 1 then
table.insert(notified_users, i)
end
end

-- Optional
redis.call('DEL', tmp_or_key)

return notified_users

The Outcome

So revisiting what changed in the infrastructure:

Before, you were doing 30K database operations - which include all the way from Parsing, Planning, Optimizing, Checking Buffer Cache - maybe reading from Disk. It's even worse, if your database is write-specialized like those with LSM Storage Engines.

Instead of doing all of this: You are now flipping bits in a pretty-much efficient CPU cache.

The current scenario?

  • Your database is free of abusive reads.
  • It can focus on the write-heavy nature of Chat systems
  • Notifications are now even faster - what took (luckily) under 50ms is now under 1ms
  • Downgrade that database man and save some money!

Take Aways Points Notes

  1. You might be asking yourself if you couldn't do so with normal Sets. And the answer is: Yes. The trade-offs? RAM Storage($$$), CPU Performance and Latency overall
  2. You could easily fit more than 10 Million Users within less than 5GB of data using Roaring Bitmaps!

· 4 min read
Lucas Weis Polesello

Back in 2022 @Luma had a major outage which caused hours of downtime, angry customers and lots of engineering efforts to return the product back to normal. One of the discoveries made at that time was a hard truth: we modeled our internal IAM object poorly.

inb4: PHI

On the HealthCare Tech Industry - and outside of it - Patient Health Information (PHI) is of major importance. When applied to Luma's platform - being very brief - states that patient's data should belong only to facilities that they had interaction with/allowed them to have. This is to protect patient's health information and access to their data. It's a US-government enforced law.

IAM Model

Our IAM model is something called session object - a pretty simple concept - it concentrates the user's token, settings, groups, facilities and more metadata about itself. We use this session throughout all backend components of Luma to properly apply this PHI filtering rule. One of the bad decision back then was to simply pull all facilities and groups inside a JSON object and cache it. But then you would probably ask...

What's the problem of it?

When Luma grown it's scale - we started onboarding bigger and bigger customers - with their own setup of account leading to N different use-cases. Summing up some very creative account setups and huge customers - we ended up creating something unexpected - session object storing up to 2.6MB of pure JSON text.

And yes, you did read it right - an entire PDF!!

Now imagine that for each job, we actually pulled cached sessions up to 100x - or even more. Luma produces average of 2K jobs per second spiking up to 10K

That's ALOT of Network usage. - easily surpassing 2GB per second - aka more than 15Gbps. For reference - cache.m6g.8xlarge which is a fair-sized cache instance has this bandwidth.

Infrastructure Impact

All of a sudden Luma has this scenario:

  • Slow and unstable HTTP APIs
  • Had to oversize more than usual to handle same load - a very low overall ~400RPS
  • Had to split our Broker instance into smaller and focused one's
  • Had to increase our cache size
  • We had to create one thing called pubsub-service (yes and yes, no bueno) to offload services of slow publishes.
  • With this - we also created a jobless feature which forced backend components to publish only the jobId and route the job content itself via Redis.

The Business Impact? Money being thrown away and unsatisfied customers

Given it's mission-critical importance of Session - the risks were just too high until 1st of March 2025. After good months of investigation, searching code, trying to run AST analysis (codemod) in almost the entire codebase and libraries - it sounded like we had a solution in-mind.

A simple feature flag - that would do Query.select(_id) when querying Groups - building the session with less data.

Although simple and sustained by a lot of research we were still cautious by setting up lots of product metrics and log metrics to understand wether rolled customers would have negative impact - like messaging not going out, notifications being missed or even worse - entire product breakdown.

Rollout and Implementation

  • We needed to ensure our libraries were at least a certain version
  • Enable flag for each customer tenant id

Outcome

It took us nearly a month rolling out 90 backend services and a week enabling the flag for all customers. But the results are very expressive.

  • 60% Network usage reduction. It's been weeks we don't have any alerts about Network Bandwidth
  • Stable API latencies. We are now able to downsize our infrastructure back to normal levels - we are estimating to downsize REST layer resources by 1/4
  • Almost zero'ed PubSub bandaid network usage. We are now unblocked to remove the bandaid solutions like pubsub-service and sessionless code

REST Layer p75 to p95 stability and latencies drop after 03/11 - first customers.

rest-latencies

Comparing pre-rollout weeks(03/17) to post-rollout (03/17) - Check the end of the graph monthly-bandwidth-usage

Post-Rollout Monday (Busiest day of week) busy-monday-after-rollout Pre-Rollout Monday (Busiest day of week) busy-monday-pre

TakeAway Points

There's much more coming in the future - but we are happy to finally unblock the road for bigger impact optimizations.

To build good product, find market-fit, prioritize customers and market requirements is an art of business but I deeply think that there's some bounding between business and this not-so-celebrated-kind-of-stuff.

At the end of the day - delivering a reliable, stable and ever-growing platform requires revisiting past decisions - behind a healthy and stable platform is a great patient experience and efficient staff.

· 5 min read
Lucas Weis Polesello

Some weeks ago, we had a incident that was caused mainly due to how we delay job execution. One queue had abnormal behavior during the weekends, which our monitoring systems caught, but we were expecting system to be self-healing and be able to kick through this small hiccup. Tuesday we noticed something was off with our integration systems - where a code that was running for more than a year stopped working at the same time we rolled it for a new big customer.

I got involved early in the incident due to how often I touch that component of the backend and we initiated investigation so we could get back to the customer as fast as possible with some good news.

After some hours of investigation we ended up concluding that our long polling mechanism had a hot-loop on a certain queue - which in turn was always retrying delays. The first time this edge case was triggered happened during the weekends where we saw the previous abnormal behavior. Due to always re-delaying jobs, the amount of postponed jobs to this queue would never actually drain and thus this component would eternally get stuck while trying to drain postponed jobs - delaying all the other queues from the platform.

In a quick 1/2 hour solution we shipped code to stop retrying jobs after more than N retries and unblocking the hot loop.

What about the future?

All that to say what? What is has to do with Redis?

As described the previous post from this series - the job delaying mechanism lives within Redis where we use a Sorted Set to pull the next job. We fanout to a certain amount in parallel - but we only restart this processing after all possible queue targets were drained up until some upper bound timestamp (being the score in a SortedSet).

With this fragility in mind and the upcoming code-freeze of December - I suggested to rewrite this design completely.

Current Design

Whenever a service requests an postpone - which can be either intentionally - calling the method to postpone it OR unintentionally - due to some IO failure, protocol error or some ephemeral channel churn - the job flows through this DelayedJobsStorage design.

The current production design is a simple Sorted Set from Redis where timestamps are the score and key being the job serialized into string.

On the infrastructure side, we use a cache.r7g.12xlarge ElastiCache (~300GB 🚨) instance to sustain our load SPIKES - like some instabilities or big customers data.

In the other end, there's a separate backend component Scheduler that is responsible for triggering individual jobs at certain interval. This component triggers a service that runs a long-polling logic until the current timestamp.

delayed-jobs-old-design delayed-jobs-old-design

New Design

I've created some abstractions to make it easier to implement new storage backends like S3, VFS and many more. Redis is now running as a mainly as reference-to-data component and only stores entire jobs if they are within the next ~2 seconds or some failure happens.

The storage rationale happens with a couple considerations:

  • How long are we planning to postpone?
  • How big (byte size) is this job?
  • Is this storage backend rolled out for certain queue OR tenant _id?
  • Is this storage available? If not, fall back to Redis

We also changed the core logic behind the backend service that re-enqueues delayed jobs:

  • Removed the coupling with Scheduler:
    • Instead of waiting for some schedule job trigger the backend service, it triggers itself given a certain interval.
  • Circuit Breaking:
    • Jobs keeps being re-enqueued up to a certain timeout. Preventing one cycle of starving resources OR locking entire service in some hot-loop.
  • Concurrent Processing with Semaphores:
    • We would always have up to N concurrent processing target queues - decoupled from other executions.
    • Speeds up draining time.
  • Better visibility:
    • We implemented proper monitoring such as latencies for each re-enqueue, amount of re-enqueues and failure/successful re-enqueues.
  • *Horizontally Scale by Controlling the ZPOPMIN:
    • Pop the least score up to an upper-bound timestamp.
    • This enables us to have multiple instances running which we were unable in the past due to the Singleton Design we had.

new-design new-design

redis-is-more-than-cache-data-reference-new-design-read_ redis-is-more-than-cache-data-reference-new-design-read_

Example

Lets use for example - storing a job for 10minutes that has a medium size due to the context it needs to carry. We would store the job data into S3 then insert a key into Redis formatted as s3::some-queue::384192393192::<some-generated-uui> When trying to re-enqueue the job - the abstraction easily understands the protocol as being s3. This abstraction then uses the implementation of s3 which knows where to fetch a certain job by this uuid.

What to expect from this Design

  • Infrastructure downsizing: From 300GB to >10GB
    • Currently we have 4million of keys stored consuming up to 200GB of RAM
    • In the new design this would be ~64mb - if we calculate uuid being 128bits and omit the other small appended data.
    • Roughly we are expecting to downsize to 10GB - since brief postpones can still occurs and overload it. Experimentation will lead us to decide the proper size.
  • Independency: Decouple backend components preventing service A stopping service B of working properly
  • Speed: N replicas will consume queues much faster in some abnormal load
  • Horizontal Scalability/Reliability: Multiple Consuming replicas.
  • Tail-Latency Aware: Latency-Sensitive components will be decoupled such as WebSocket notifications, Chat Messages and all real time communication - heavily used in the platform

Concerns

  • RPS - How many jobs are being delayed per sec and their latencies? p75, p90, p99
  • S3 speed for large objects -> Will it impact our re-enqueueing speed?
  • Proper Redis sizing -> How far can we push lower?
  • Redis is still a single point of failure -> What if Redis just dies?

· 4 min read
Lucas Weis Polesello

My current company - Luma Health Inc - has an Event-Driven Architecture where all of our backend systems interact via async messaging/jobs. Thus our backbone is sustained by an AMQP broker - RabbitMQ - which routes the jobs to interested services.

Since our jobs are very critical - we cannot support failures AND should design to make the system more resilient because well..we don't want a patient not being notified of their appointment, appointments not being created when they should, patients showing off into facilities where they were never notified the patient had something scheduled.

Besides the infra and product reliability - some use cases could need postponing - maybe reaching out to an external system who's offline/or not responding. Maybe some error which needs a retry - who knows?

The fact is, delaying/retry is a very frequent requirement into Event Driven Architectures. With this a service responsible for doing it was created - and it worked fine.

But - as the company sold bigger contracts and grew up in scale - this system was almost stressed out and not reliable.

The Unreliable Design

Before giving the symptoms, let's talk about the organism itself - the service old design.

The design was really straightforward - if our service handlers asked for a postpone OR we failed to send the message to RabbitMQ - we just insert the JSON object from the Job into a Redis Sorted Set and using the Score as the timestamp which it was meant to be retried/published again.

To publish back into RabbitMQ the postponed messages, a job would be triggered each 5 seconds - doing the following:

  1. Read from a set key containing all the existing sorted set keys - basically the queue name
  2. Fetch jobs using zrangebyscore from 0 to current timestamp BUT limit to 5K jobs.
  3. Publish the job to RabbitMQ and remove it from sorted set

The Issues

This solution actually scaled up until 1-2 years ago when we started having issues with it - the main one's being:

  1. It could not catch up to a huge backlog of delayed messages
  2. It would eventually OOM or SPIKE up to 40GB of memory
    1. Due to things being fetched into memory AND some instability OR even some internal logic - we could end up shoveling too much data into Redis - the service just died 💀
  3. We could not scale horizontally - due to consuming and fetching objects into memory before deleting them.

The Solution

The solution was very simple: we implemented something that I liked to call streaming approach

Using the same data structure, we are now:

  1. Running a zcount from 0 to current timestamp
    • Counting the amount of Jobs -> returning N
  2. Creating an Async Iterator for N times - that used the zpopmin method from Redis
    • zpopmin basically returns AND removes the least score object - ie most recent timestamp

The processor for the SortedSet process-delayed-jobs-code

The Async Iterator job-scan-iterator

And that's all!

This simple algorithm change annihilated the need for:

  1. Big In Memory fetches - makes our memory allocation big
  2. Limit of 5K in fetches - makes our throughput lower

Results - which I think the screenshots can speak for themselves

We processed the entire backlog of 40GB of pending jobs pretty quickly aws-memory-consumption

From a constant usage of ~8GB - we dropped down to ~200MB memory-delayed-jobs We are now - trying to be play safe and still oversize - safely allocating 1/4 of the resources. The git diff was our first resource dump - we went even further.

git-diff-memory-delayed-jobs

Money-wise: We are talking at least of 1K USD/month AND more in the future if we can lower our ElastiCache instance.

Take Away Points

  • Redis is a Distributed DataStructure database - more than just a simply KeyValue pair storage.
  • You can achieve great designs using Redis
  • Be careful because the way you design a solution with Redis may be costly in the future

Final Thoughts

There could be a lot of discussion wether this is a great way of doing jobs postponing, if Redis is the right storage, if we should really postpone jobs for small network hiccups, shouldn't we leverage DelayedExchange from Rabbit? - etc... But at the end of the day - to succeed as a company we need to solve problems in our daily routine. Some problems are worth, some are not. It's always - one step at a time.