Zalgo Effect is an term used to describe unexpected outcomes of mixing sync
and async
javascript code.
It means - if you mix these two approaches SOMETHING weird will happen.
It's one of those things you kinda don't understand until you see it in real production systems.
So what it has to do with Resource Leakage?
One day, our SRE team received a couple PagerDuty alerts claiming our services were restarting and not able to work properly due to Error: No channels left to allocate
- ie RabbitMQ connections were maxing out in channel allocation. (For RabbitMQ reference into Channels and Connections)
It was clear some code was leaking channel creations. No one knew what could potentially be - but God I had studied this Zalgo Effect
in NodeJS Design Patterns
Book and it clicked me something.
How was I so sure the Zalgo was the culprit?
The service that was throwing that error was only responsible for fan out a couple messages to a lot of other services - so it was easy as creating a Queue
object and running N promises concurrently to publish some message.
Checking the RabbitMQ Management UI showed me that we created N channels for that connection.
But why it only happened in some scenarios?
That's where the Zalgo Effect
pops in.
Our code was built back in ~2015 - Node 4. The callback style was the mainstream. Our Engineers created the abstraction Queue
which dealt with almost 50% of our Event-Driven Architecture by itself and had to make the class
style w/ async initializations - not so easily with callbacks.
So the code assumed the following:
- Assert exchange, queues and necessary resources - via something we could call
consumeChannel
.- The consume channel is created whenever the connection is made.
- Our
confirmChannel
- ie the channel we used topublish
events was lazily created - mixingasync
andsync
code.
So the problem lives in 2).
Imagine the following:
- We
assertConfirmChannel
- It check's whether the channel EXISTS or NOT.
- If not, create via
PROMISE
and return control to EventLoop - If does, return it
What happens, if the two concurrent promises
reaches the same if
without the first promise resolving? We try to create the channel two times and override them - thus keeping channels open but just using only the last one.
This is where the code was leaking channels.
Fixing the problem
Well, the fix we actually shipped was simply calling 1 Promise and await it and then fan out the other promises.
But we made it simple due to risks and since the code is being refactored into a new style.
How can I fix If I see something like that?
If you want a real solution, here's what the V2 would look like - the idea is to create Promises and assign variables with them, instead of doing await
on it. Example as below:
This easily fixes the problem - by setting a variable as promise and checking its existence.
A more robust style, where you actually need to initialize a couple of resources, you could do something like below
- Create a function to execute the entire Promise.
- Set up some reference to it
- If requested the same, just use the same Promise.
Ok - but why it fixes the problem?
The idea is to make sure - we are running things in a sync manner and just making the promises settled on their timing. We need to think about the synchronous code execution block to reason about our promises usage.