DataStructure Applicabilities - Bitmap and Access Control: How to scale notifications with ACL

April 6, 2025 · 5 min read

SRE | Senior Software Engineer @ LumaHealth

As engineers we are very used to manipulating data in well-know structures like arrays, queues, sets and maps as they solve day-to-day issues - like when you need to manipulate relationships from database on the client-side, you use Sets/Maps to prevent O(nˆ2)searches.

But one of the not-so common data structures is the Bitmap - being very math-focused and tailored for certain scenarios - it's usually forgotten.

Today we are going to showcase one scenario where it might be interesting to use it.

The Problem with Access Control

In many SaaS platforms - the ACL model usually revolves around many-to-many relationships, as example - imagine you have a software that solves for real-time communication such as Slack but give it a bit more of spiciness (as if Slack needs it?) - give it 1 more layer of control.

You would have an User U that belongs to Groups G. And each Group has it's own set of Teams T. Each sent message for a Team T should be Broadcasted to all online Users that have access to Team T.

With this - you do have a problem where you need to ensure that messages CANNOT be routed to wrong recipients and other Teams may join the same conversation.

The Challenge Is About Scale

You can scale this by simply doing database queries - up to a certain scale this is fairly simple. A query would hardly reach >50ms if well-indexed. Even if it was poorly optimized wouldn't be a huge problem. Nonetheless, maybe even the amount of notifications is not a problem!

But We Like To Pretend Windmills are Dragons

The real challenge is: What if, either lucky or the biggest salesman ever, your company reaches a huge milestone and now scale hits upon the backdoor

The Scenario:

Out of a sudden now you are handling 30K+ notifications per second. Your engineering team, knowing the domain of the product, designed almost all modeling and infrastructure with this in mind. But yet hammering the database 30K+ times per second is far from ideal - although injecting$$$ solves it - we like to pretend people care for green solutions in this blog.

At a certain day - a Slack thread arrives:

Engineer 1: Since the last Go-live our database system is just crawling man... I don't know how long we can sustain this.
Engineer 2: Yeah... We jumped from 10K Messages/Sec to more than 30K - no wonder it just got hammered
Engineer 1: I think it's time to optimize this but that might be hard - maybe some caching layer?
Engineer 2: Yeah, caching could be the way - but how?
Engineer 1: I have no idea to be honest. it might be too much data for Redis?
Engineer 3: Yo I am pretty sure some has solved this...
CTO: I think we can use Bitmaps for that ?

BitMaps and Access Control

Finally we reached the main part of this article - Bitmaps - are data structures specialized in binary operations - being fast and space-efficient. By such they efficiently tell us wether something is ON/OFF or TRUE/FALSE.

The Design

Each Tenant would have all of its Teams lay'ed out in Redis Bitmaps such as:

teams:<tenantId>:<tid>
online-users:<tenantId>

(Assuming you have de-normalized this relations due to it's read-pattern) Instead of doing:

SELECT
    ug.user
FROM
    groups ug
WHERE
    ug.teams IN (<list of teams>)
    AND ug.tenantId = <tenantId>

You would simply use a small LuaScript:

local tenantId = ARGV[1]
local tmp_or_key = 'tmp:teams'
local notify_key = 'tmp:notification'
local online_key = 'online-users:' + tenantId

-- Store tmp key for all SET bitmaps using OR operator
redis.call('BITOP', 'OR', tmp_or_key, unpack(KEYS))

-- Store the key with the 
redis.call('BITOP', 'AND', notify_key, tmp_or_key, online_key)

-- Extract those Ids
local max_index = redis.call('STRLEN', notify_key) * 8
local notified_users = {}
for i = 0, max_index - 1 do
  local bit = redis.call('GETBIT', notify_key, i)
  if bit == 1 then
    table.insert(notified_users, i)
  end
end

-- Optional
redis.call('DEL', tmp_or_key)

return notified_users

The Outcome

So revisiting what changed in the infrastructure:

Before, you were doing 30K database operations - which include all the way from Parsing, Planning, Optimizing, Checking Buffer Cache - maybe reading from Disk. It's even worse, if your database is write-specialized like those with LSM Storage Engines.

Instead of doing all of this: You are now flipping bits in a pretty-much efficient CPU cache.

The current scenario?

Your database is free of abusive reads.
It can focus on the write-heavy nature of Chat systems
Notifications are now even faster - what took (luckily) under 50ms is now under 1ms
Downgrade that database man and save some money!

Take Aways Points Notes

You might be asking yourself if you couldn't do so with normal Sets. And the answer is: Yes. The trade-offs? RAM Storage($$$), CPU Performance and Latency overall
You could easily fit more than 10 Million Users within less than 5GB of data using Roaring Bitmaps!

The Tale of How We Compensated Years of Tech Debt in a Month: Performance, Cost and Optimizations

March 31, 2025 · 4 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

Back in 2022 @Luma had a major outage which caused hours of downtime, angry customers and lots of engineering efforts to return the product back to normal. One of the discoveries made at that time was a hard truth: we modeled our internal IAM object poorly.

inb4: PHI

On the HealthCare Tech Industry - and outside of it - Patient Health Information (PHI) is of major importance. When applied to Luma's platform - being very brief - states that patient's data should belong only to facilities that they had interaction with/allowed them to have. This is to protect patient's health information and access to their data. It's a US-government enforced law.

IAM Model

Our IAM model is something called session object - a pretty simple concept - it concentrates the user's token, settings, groups, facilities and more metadata about itself. We use this session throughout all backend components of Luma to properly apply this PHI filtering rule. One of the bad decision back then was to simply pull all facilities and groups inside a JSON object and cache it. But then you would probably ask...

What's the problem of it?

When Luma grown it's scale - we started onboarding bigger and bigger customers - with their own setup of account leading to N different use-cases. Summing up some very creative account setups and huge customers - we ended up creating something unexpected - session object storing up to 2.6MB of pure JSON text.

And yes, you did read it right - an entire PDF!!

Now imagine that for each job, we actually pulled cached sessions up to 100x - or even more. Luma produces average of 2K jobs per second spiking up to 10K

That's ALOT of Network usage. - easily surpassing 2GB per second - aka more than 15Gbps. For reference - cache.m6g.8xlarge which is a fair-sized cache instance has this bandwidth.

Infrastructure Impact

All of a sudden Luma has this scenario:

Slow and unstable HTTP APIs
Had to oversize more than usual to handle same load - a very low overall ~400RPS
Had to split our Broker instance into smaller and focused one's
Had to increase our cache size
We had to create one thing called pubsub-service (yes and yes, no bueno) to offload services of slow publishes.
With this - we also created a jobless feature which forced backend components to publish only the jobId and route the job content itself via Redis.

The Business Impact? Money being thrown away and unsatisfied customers

Given it's mission-critical importance of Session - the risks were just too high until 1st of March 2025. After good months of investigation, searching code, trying to run AST analysis (codemod) in almost the entire codebase and libraries - it sounded like we had a solution in-mind.

A simple feature flag - that would do `Query.select(_id)` when querying `Groups` - building the session with less data.

Although simple and sustained by a lot of research we were still cautious by setting up lots of product metrics and log metrics to understand wether rolled customers would have negative impact - like messaging not going out, notifications being missed or even worse - entire product breakdown.

Rollout and Implementation

We needed to ensure our libraries were at least a certain version
Enable flag for each customer tenant id

Outcome

It took us nearly a month rolling out 90 backend services and a week enabling the flag for all customers. But the results are very expressive.

60% Network usage reduction. It's been weeks we don't have any alerts about Network Bandwidth
Stable API latencies. We are now able to downsize our infrastructure back to normal levels - we are estimating to downsize REST layer resources by 1/4
Almost zero'ed PubSub bandaid network usage. We are now unblocked to remove the bandaid solutions like pubsub-service and sessionless code

REST Layer p75 to p95 stability and latencies drop after 03/11 - first customers.

rest-latencies

Comparing pre-rollout weeks(03/17) to post-rollout (03/17) - Check the end of the graph monthly-bandwidth-usage

Post-Rollout Monday (Busiest day of week) busy-monday-after-rollout Pre-Rollout Monday (Busiest day of week) busy-monday-pre

TakeAway Points

There's much more coming in the future - but we are happy to finally unblock the road for bigger impact optimizations.

To build good product, find market-fit, prioritize customers and market requirements is an art of business but I deeply think that there's some bounding between business and this not-so-celebrated-kind-of-stuff.

At the end of the day - delivering a reliable, stable and ever-growing platform requires revisiting past decisions - behind a healthy and stable platform is a great patient experience and efficient staff.

Open Source Tales #1: WebSocket Operator

March 31, 2025 · 3 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

One of the most interesting career challenges I've faced was something trivial in the world of stateless services but challenging in stateful - enabling WebSocket instances to scale horizontally.

This challenge comes in many flavors and ours had some constraints:

It had to respect our internal framework - listening por model events
We had to apply IAM filtering
Had to use SocketIO
SocketIO plugins like RabbitMQ were not valid. Team judge as costly.
Redis Plugins were not fit.
We had to support multi-tab
No infrastructure involved

Basically our WebSocket servers ran with old versions of SocketIO and had they very poor usage of its benefits, to say the least. It could've been a simple WebSocket server.

To scale it horizontally, we decided to use Redis PubSub - by simply allowing clients-server communicate via Redis PubSub.

End of Project and I learned something very important

And that is choosing to scale WebSocket was a bad idea by itself. As intrinsic to distributed systems, problems like:

Observability
Redis PubSub deliverability issues and Network bandwidth
It lacked re-balancing connections - hot replicas vs cold replicas by amount of conns.

But I had so many limitations that I crossed my mind:

What If I could just use the underlying environment - aka Kubernetes - for this kind of stuff? Some refined load balancing? Proper routing of messages via proxy? (TBH a simple RMQ would've done the job so far)

Considering that I never-ever stopped dreaming about a better design for this problem

By consequence 2y ears later I stumbled upon this article which described - beautifully - how they solved scalability for a WebSocket stateful app.

It motivated me into a crazy journey: If I solved the same issue, if they solved the same issue and we had similar ideas. How many people are solving this same challenge?

Introducing you websocket-operator

ws-operator-poc-diagram

In this blueprint and not-yet-prod-deployed-oss-project I've mixed two main ingredients:

The need to learn heavier GO and Kubernetes Operator.
A problem I've already solved - but now with no limitations

The Operator consists of three main components - and yes they are very similar to those from the article.

A LoadBalancer
- A end-user exposed API that accepts connections and routes to proper proxy-sidecars.
- It applies a distributed load balancing algorithm - shared with the Proxy SideCar
- It uses a Kubernetes Service Discovery to check for available IP's to load-balancer.
Proxy SideCar
- Intercept WS Messages and proxies via HTTP to another Proxy SideCar that may have the connection for such recipient.
- It shares the same algorithm from LoadBalancer - they can find themselves
Controller
- Injects the SideCar in Deployments/Pods with ws.operator/enabled label
- Re-Balances connected users.

Roadmap

This is pretty much inspired in the articles Signaling Servers design and has some interesting features in the Roadmap:

Pluggable Hashing Algorithm for Routing
- Plug your own algorithm to load balance connections
Pluggable Routing
- v0.0.1 will assume WS is exchanging JSON messages - but they could be RAW binaries, or just simple text with their own protocol.
Support Broadcasting

TakeAway

There's a intelectual value in reinventing the wheel
Do not scale stateful apps unless very needed
And if you need, reconsider it again. You might be safer just oversizing infrastructure
Ok, you really need it. Study, investigate, research and well, feel free to benchmark this plug and play solution.

Redis is more than a Cache - Data Reference

March 15, 2024 · 5 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

Some weeks ago, we had a incident that was caused mainly due to how we delay job execution. One queue had abnormal behavior during the weekends, which our monitoring systems caught, but we were expecting system to be self-healing and be able to kick through this small hiccup. Tuesday we noticed something was off with our integration systems - where a code that was running for more than a year stopped working at the same time we rolled it for a new big customer.

I got involved early in the incident due to how often I touch that component of the backend and we initiated investigation so we could get back to the customer as fast as possible with some good news.

After some hours of investigation we ended up concluding that our long polling mechanism had a hot-loop on a certain queue - which in turn was always retrying delays. The first time this edge case was triggered happened during the weekends where we saw the previous abnormal behavior. Due to always re-delaying jobs, the amount of postponed jobs to this queue would never actually drain and thus this component would eternally get stuck while trying to drain postponed jobs - delaying all the other queues from the platform.

In a quick 1/2 hour solution we shipped code to stop retrying jobs after more than N retries and unblocking the hot loop.

What about the future?

All that to say what? What is has to do with Redis?

As described the previous post from this series - the job delaying mechanism lives within Redis where we use a Sorted Set to pull the next job. We fanout to a certain amount in parallel - but we only restart this processing after all possible queue targets were drained up until some upper bound timestamp (being the score in a SortedSet).

With this fragility in mind and the upcoming code-freeze of December - I suggested to rewrite this design completely.

Current Design

Whenever a service requests an postpone - which can be either intentionally - calling the method to postpone it OR unintentionally - due to some IO failure, protocol error or some ephemeral channel churn - the job flows through this DelayedJobsStorage design.

The current production design is a simple Sorted Set from Redis where timestamps are the score and key being the job serialized into string.

On the infrastructure side, we use a cache.r7g.12xlarge ElastiCache (~300GB 🚨) instance to sustain our load SPIKES - like some instabilities or big customers data.

In the other end, there's a separate backend component Scheduler that is responsible for triggering individual jobs at certain interval. This component triggers a service that runs a long-polling logic until the current timestamp.

delayed-jobs-old-design

New Design

I've created some abstractions to make it easier to implement new storage backends like S3, VFS and many more. Redis is now running as a mainly as reference-to-data component and only stores entire jobs if they are within the next ~2 seconds or some failure happens.

The storage rationale happens with a couple considerations:

How long are we planning to postpone?
How big (byte size) is this job?
Is this storage backend rolled out for certain queue OR tenant _id?
Is this storage available? If not, fall back to Redis

We also changed the core logic behind the backend service that re-enqueues delayed jobs:

Removed the coupling with Scheduler:
- Instead of waiting for some schedule job trigger the backend service, it triggers itself given a certain interval.
Circuit Breaking:
- Jobs keeps being re-enqueued up to a certain timeout. Preventing one cycle of starving resources OR locking entire service in some hot-loop.
Concurrent Processing with Semaphores:
- We would always have up to N concurrent processing target queues - decoupled from other executions.
- Speeds up draining time.
Better visibility:
- We implemented proper monitoring such as latencies for each re-enqueue, amount of re-enqueues and failure/successful re-enqueues.
*Horizontally Scale by Controlling the ZPOPMIN:
- Pop the least score up to an upper-bound timestamp.
- This enables us to have multiple instances running which we were unable in the past due to the Singleton Design we had.

new-design

redis-is-more-than-cache-data-reference-new-design-read_

Example

Lets use for example - storing a job for 10minutes that has a medium size due to the context it needs to carry. We would store the job data into S3 then insert a key into Redis formatted as s3::some-queue::384192393192::<some-generated-uui> When trying to re-enqueue the job - the abstraction easily understands the protocol as being s3. This abstraction then uses the implementation of s3 which knows where to fetch a certain job by this uuid.

What to expect from this Design

Infrastructure downsizing: From 300GB to >10GB
- Currently we have 4million of keys stored consuming up to 200GB of RAM
- In the new design this would be ~64mb - if we calculate uuid being 128bits and omit the other small appended data.
- Roughly we are expecting to downsize to 10GB - since brief postpones can still occurs and overload it. Experimentation will lead us to decide the proper size.
Independency: Decouple backend components preventing service A stopping service B of working properly
Speed: N replicas will consume queues much faster in some abnormal load
Horizontal Scalability/Reliability: Multiple Consuming replicas.
Tail-Latency Aware: Latency-Sensitive components will be decoupled such as WebSocket notifications, Chat Messages and all real time communication - heavily used in the platform

Concerns

RPS - How many jobs are being delayed per sec and their latencies? p75, p90, p99
S3 speed for large objects -> Will it impact our re-enqueueing speed?
Proper Redis sizing -> How far can we push lower?
Redis is still a single point of failure -> What if Redis just dies?

Redis is more than a Cache - Delaying Jobs

March 15, 2024 · 4 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

My current company - Luma Health Inc - has an Event-Driven Architecture where all of our backend systems interact via async messaging/jobs. Thus our backbone is sustained by an AMQP broker - RabbitMQ - which routes the jobs to interested services.

Since our jobs are very critical - we cannot support failures AND should design to make the system more resilient because well..we don't want a patient not being notified of their appointment, appointments not being created when they should, patients showing off into facilities where they were never notified the patient had something scheduled.

Besides the infra and product reliability - some use cases could need postponing - maybe reaching out to an external system who's offline/or not responding. Maybe some error which needs a retry - who knows?

The fact is, delaying/retry is a very frequent requirement into Event Driven Architectures. With this a service responsible for doing it was created - and it worked fine.

But - as the company sold bigger contracts and grew up in scale - this system was almost stressed out and not reliable.

The Unreliable Design

Before giving the symptoms, let's talk about the organism itself - the service old design.

The design was really straightforward - if our service handlers asked for a postpone OR we failed to send the message to RabbitMQ - we just insert the JSON object from the Job into a Redis Sorted Set and using the Score as the timestamp which it was meant to be retried/published again.

To publish back into RabbitMQ the postponed messages, a job would be triggered each 5 seconds - doing the following:

Read from a set key containing all the existing sorted set keys - basically the queue name
Fetch jobs using zrangebyscore from 0 to current timestamp BUT limit to 5K jobs.
Publish the job to RabbitMQ and remove it from sorted set

The Issues

This solution actually scaled up until 1-2 years ago when we started having issues with it - the main one's being:

It could not catch up to a huge backlog of delayed messages
It would eventually OOM or SPIKE up to 40GB of memory
1. Due to things being fetched into memory AND some instability OR even some internal logic - we could end up shoveling too much data into Redis - the service just died 💀
We could not scale horizontally - due to consuming and fetching objects into memory before deleting them.

The Solution

The solution was very simple: we implemented something that I liked to call streaming approach

Using the same data structure, we are now:

Running a zcount from 0 to current timestamp
- Counting the amount of Jobs -> returning N
Creating an Async Iterator for N times - that used the zpopmin method from Redis
- zpopmin basically returns AND removes the least score object - ie most recent timestamp

The processor for the SortedSet process-delayed-jobs-code

The Async Iterator job-scan-iterator

And that's all!

This simple algorithm change annihilated the need for:

Big In Memory fetches - makes our memory allocation big
Limit of 5K in fetches - makes our throughput lower

Results - which I think the screenshots can speak for themselves

We processed the entire backlog of 40GB of pending jobs pretty quickly aws-memory-consumption

From a constant usage of ~8GB - we dropped down to ~200MB memory-delayed-jobs We are now - trying to be play safe and still oversize - safely allocating 1/4 of the resources. The git diff was our first resource dump - we went even further.

git-diff-memory-delayed-jobs

Money-wise: We are talking at least of 1K USD/month AND more in the future if we can lower our ElastiCache instance.

Take Away Points

Redis is a Distributed DataStructure database - more than just a simply KeyValue pair storage.
You can achieve great designs using Redis
Be careful because the way you design a solution with Redis may be costly in the future

Final Thoughts

There could be a lot of discussion wether this is a great way of doing jobs postponing, if Redis is the right storage, if we should really postpone jobs for small network hiccups, shouldn't we leverage DelayedExchange from Rabbit? - etc... But at the end of the day - to succeed as a company we need to solve problems in our daily routine. Some problems are worth, some are not. It's always - one step at a time.

The Zalgo Effect and Resource Leakage - A Case

March 15, 2024 · 4 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

Zalgo Effect is an term used to describe unexpected outcomes of mixing sync and async javascript code.

It means - if you mix these two approaches SOMETHING weird will happen.

It's one of those things you kinda don't understand until you see it in real production systems.

So what it has to do with Resource Leakage?

One day, our SRE team received a couple PagerDuty alerts claiming our services were restarting and not able to work properly due to Error: No channels left to allocate - ie RabbitMQ connections were maxing out in channel allocation. (For RabbitMQ reference into Channels and Connections)

It was clear some code was leaking channel creations. No one knew what could potentially be - but God I had studied this Zalgo Effect in NodeJS Design Patterns Book and it clicked me something.

How was I so sure the Zalgo was the culprit?

The service that was throwing that error was only responsible for fan out a couple messages to a lot of other services - so it was easy as creating a Queue object and running N promises concurrently to publish some message. Checking the RabbitMQ Management UI showed me that we created N channels for that connection.

But why it only happened in some scenarios?

That's where the Zalgo Effect pops in.

Our code was built back in ~2015 - Node 4. The callback style was the mainstream. Our Engineers created the abstraction Queue which dealt with almost 50% of our Event-Driven Architecture by itself and had to make the class style w/ async initializations - not so easily with callbacks.

So the code assumed the following:

Assert exchange, queues and necessary resources - via something we could call consumeChannel.
1. The consume channel is created whenever the connection is made.
Our confirmChannel - ie the channel we used to publish events was lazily created - mixing async and sync code.

So the problem lives in 2).

Imagine the following:

We assertConfirmChannel
It check's whether the channel EXISTS or NOT.
If not, create via PROMISE and return control to EventLoop
If does, return it

What happens, if the two concurrent promises reaches the same if without the first promise resolving? We try to create the channel two times and override them - thus keeping channels open but just using only the last one.

This is where the code was leaking channels.

Fixing the problem

Well, the fix we actually shipped was simply calling 1 Promise and await it and then fan out the other promises.

But we made it simple due to risks and since the code is being refactored into a new style.

How can I fix If I see something like that?

If you want a real solution, here's what the V2 would look like - the idea is to create Promises and assign variables with them, instead of doing await on it. Example as below:

assert-zalgo

This easily fixes the problem - by setting a variable as promise and checking its existence.

A more robust style, where you actually need to initialize a couple of resources, you could do something like below

get-or-create-client-print

Create a function to execute the entire Promise.
Set up some reference to it
If requested the same, just use the same Promise.

Ok - but why it fixes the problem?

The idea is to make sure - we are running things in a sync manner and just making the promises settled on their timing. We need to think about the synchronous code execution block to reason about our promises usage.

NodeJS 101 - Lazy Initialization

December 24, 2023 · 2 min read

Lucas Weis Polesello

SRE | Senior Software Engineer @ LumaHealth

One of the main challenges when dealing w/ the async nature of NodeJS is initializing classes/clients that requires some sort of side effect - such as database connection, disk reads or whatsoever. Even the simple idea of waiting for the first use-case to connect/initialize a resource.

Besides Dependency Injection - I like to use two approaches for this:

1) Leaving it up to the client to call connect or any other synonym - easy as creating an async function as the example below

const redis = require('redis');
const crypto = require('crypto');
//PROS: Damn easy, simple and straight-forward

//CONS: This leaves the entire responsibility to the client
class DistributedDataStructure {
    constructor(){
        this.client = redis.createClient();
    }

    async connect(){
        return this.client.connect();
    }

    async add(staffName, reviewId){
        //Do some business here - idk,
        const accountName = await this.client.get(key);
        return this.client.sAdd(`v1:${accountName}:pending-reviews`, reviewId);
    }
}

(async () => {
    const ds = new DistributedDataStructure();
    await ds.connect();
    ds.add('Jerome', crypto.randomBytes(12).toString('hex'));
})()

2) Proxying the access

In the real and wild-world we know that we have to deal w/ legacy code, legacy initialization methods and much more unexpected stuff - for this we have a second use-case which leverages the (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Proxy])[Proxy API for JS]

Using Proxy it would look poorly-like

const redis = require('redis');
const { once } = require('events');
const crypto = require('crypto');

//PROS: No client responsibility - makes it easy for the client
//CONS: More complex and error prone
class ProxiedDistributedDataStructure {
    constructor(){
        this.client = redis.createClient();
        this.client.connect();
        return new Proxy(this, {
            get(target, property){
                const descriptor = target[property];
                if(!descriptor){
                    return;
                }
                if(target.isReady){
                    return descriptor;
                }
                return async function(){
                    await once(target.client, 'ready');
                    return descriptor.apply(target, arguments);
                }
            }
        });
    }

    async add(staffName, reviewId){
        //Do some business here - idk - like below
        const accountName = await this.client.get(staffName);
        return this.client.sAdd(`v1:${accountName}:pending-reviews`, reviewId);
    }
}

const client = new ProxiedDistributedDataStructure();
client.add('Jerome', crypto.randomBytes(12).toString('hex'));

The main benefit for the second approach is that we can instantiate the objects in sync contexts and only treat the method calls as async - instead of needing to play around some dirty gimmicks to call connect and chain promises - even worse, callbackifying.

NOTES: AFAIC from Redis V3^ we have an option legacyMode whenever creating the client which we can keep this lazy nature of Redis - doing client buffering of calls.

The Problem with Access Control​

The Challenge Is About Scale​

But We Like To Pretend Windmills are Dragons​

The Scenario:​

BitMaps and Access Control​

The Design​

The Outcome​

Take Aways Points Notes​

inb4: PHI​

IAM Model​

What's the problem of it?​

Infrastructure Impact​

The Business Impact? Money being thrown away and unsatisfied customers​

A simple feature flag - that would do Query.select(_id) when querying Groups - building the session with less data.​

Rollout and Implementation​

Outcome​

TakeAway Points​

End of Project and I learned something very important​

But I had so many limitations that I crossed my mind:​

Considering that I never-ever stopped dreaming about a better design for this problem​

Introducing you websocket-operator​

Roadmap​

TakeAway​

What about the future?​

Current Design​

New Design​

Example​

What to expect from this Design​

Concerns​

The Unreliable Design​

The Issues​

The Solution​

Results - which I think the screenshots can speak for themselves​

Take Away Points​

Final Thoughts​

So what it has to do with Resource Leakage?

How was I so sure the Zalgo was the culprit?

But why it only happened in some scenarios?

Fixing the problem​

How can I fix If I see something like that?

Ok - but why it fixes the problem?

The Problem with Access Control

The Challenge Is About Scale

But We Like To Pretend Windmills are Dragons

The Scenario:

BitMaps and Access Control

The Design

The Outcome

Take Aways Points Notes

inb4: PHI

IAM Model

What's the problem of it?

Infrastructure Impact

The Business Impact? Money being thrown away and unsatisfied customers

A simple feature flag - that would do `Query.select(_id)` when querying `Groups` - building the session with less data.

Rollout and Implementation

Outcome

TakeAway Points

End of Project and I learned something very important

But I had so many limitations that I crossed my mind:

Considering that I never-ever stopped dreaming about a better design for this problem

Introducing you websocket-operator

Roadmap

TakeAway

What about the future?

Current Design

New Design

Example

What to expect from this Design

Concerns

The Unreliable Design

The Issues

The Solution

Results - which I think the screenshots can speak for themselves

Take Away Points

Final Thoughts

Fixing the problem