Node.js Cluster usage, design problematics: Use Redis !

Hello,

As some of you know because you read the about section, I worked for the CNRS on a big project called ISTEX. We had to deal with BIG DATA problematics (with 17M docs). For this, we had many servers and solid virtual machines on it.

nodejs-new-white-pantone

 

We first developed an API over elasticsearch with Node.js, without using any kind of multi-threading. This was very stable, but performanceless. When we wanted to switch to a muti-threaded solution, we first came to native Node.js Cluster. With this, you can easily fork parts of your code. But every fork will owns his own process (in every senses).

Here is the easiest way to use Node.js Cluster, provided by the official doc :

const cluster = require('cluster');

if (cluster.isMaster) { // The master is the main node process
  console.log('I am master');
  cluster.fork(); // Worker 1
  cluster.fork(); // Worker 2
} else if (cluster.isWorker) { // Workers are the child processes
  console.log(`I am worker #${cluster.worker.id}`);
}

 

Maybe you already see a lot of problems that can happen at this time if you didn’t designed your application for using Cluster, in a single threaded way.

One of our main problem was the IP authentification, since our users were mostly institutions, this was our main auth way.

So, to detail the deal, every fork instanciated their own IP adresses list. And since IPs were provided through an XLS with some unstandardised ranges (declared through the “licences nationales”‘s website), we had a very long list to provide to our IP filtering module, more than a million IP to whitelist, with the IPs onwer’s details… multiplicated by 32 threads (one for every core on the virtual machine). Useless to say that the API made a lot of time to boot and was very RAM’ovore (almost 4Gigs dedicated to IP lists).

I already saw you coming : Why didn’t we used MongoDB in first place ? Well, because in first place we only had 1 thread and we wanted to be as fast as possible. Using Mongo should have increase the latency for every request by 10ms (and we are checking IP for every request for security reason).

So what ? We redesigned the app to calculate CIDR with NPM modules like ip-subnet-calculator, reducing that way from millions of IP to ~3000 IP with suffixes. And we used Redis as a convergeance’s point for every fork, because for some obscure reasons, cluster.isMaster never worked as intended.

Node.js Redis client automatically converts hashes to JSON Objects. We stocked 3000 IP in Redis, with every fork reading the same base, and in case of update, increased and “update timestamp” stocked in. And because access could be revoked and granted at anytime, flushing the Redis base and refilling it is even faster than search for the key to delete (<4ms). Today the API can provide you a doc’s fulltext in TXT format in less than 30ms in most case, with an IP filtering during less than a millisecond, because Redis is an in-memory NoSQL base (unlike MongoDB).

You can test it here. Even without IP access you will be able to experiment the IP filtering 🙂

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s