We'll explore how to use cluster (from the standard library), GraphicsMagick, and streams to efficiently process images without slowing down or blocking the Node.JS event loop
Skip to the end if you just want a link to the full source!
Background and motivation
When I first showed tsatter.com to the world it slowed some people's computers to a crawl and some really big images even crashed people's browsers. Huh. Apparently showing a bunch of original (file) sized, user submitted gifs on the front page isn't a good idea. Who would've thought.
Scaling up the server CPU enough to run both Node.JS and the image processing smoothly on the same processor core would need a ridiculously powerful processor. Actually I'm not sure if such a thing even exists, especially when there starts to be enough traffic for at least one image to be in the middle of processing all the time. Besides, multicore processors are everywhere nowadays.
With having at least one core dedicated entirely to the processing of images, the image processor core can slow down all it wants without it affecting the server in practically any way. This also has the benefit of completely separating the image processor code from everything else, so replacing the worker with something that runs on a completely different machine altogether would actually be an almost trivially easy thing to do.
At the time of writing, the client-side code is still unfinished, but all the relevant parts for this blog post are done, so I thought I'd take a break from coding and write this thing out of the way. Hopefully by the time you're reading this the front-end is ready as well and the whole feature set deployed live.
Challenges we'll face
Efficiency
Node.JS handles everything in a single thread by default and processing images is processor heavy and time consuming. If we process the images in the same thread that Node.JS uses for communicating with clients, every time an image is being processed it will slow down how fast the server processes requests. It might even stop responding for a while. This is unacceptable if we want the website to appear snappy from the user's point of view. You will just lose the user if the first time page load takes 3 seconds.
Thankfully, Node.JS standard library has a thing called cluster that makes setting up a worker thread and IPC surprisingly easy.
File types and sizes
We don't want to waste any more time than what we absolutely have to with URLs that don't point to images. We also cannot just download the whole file and then check it's size or type. What if it's multiple gigabytes?
Blindly trusting the content-length headers from the server's response is not a good idea either, the header could be intentionally or unintentionally wrong.
Luckily streams to the rescue.
Downloading
So what we want to achieve in this chapter is
- Find out the file type as soon as possible
- Make sure we don't download images that are too big Like I said earlier, we shouldn't trust content-length headers alone for the size information. But that doesn't mean we can't use them at all. I think the best usage for them is for discarding some URLs before we even start a download.
By the way here's the stackoverflow answer where I got the download size stuff from. I then added file type checking.
So let's check the headers with a HEAD request using the always useful request library. I promise we'll get to the really interesting stuff soon.
var download = function(url, callback) {
var stream = request({
url: url,
method: 'HEAD'
}, function(err, headRes) {
if(err) {
console.log(err);
return callback(err);
}
var size = headRes.headers['content-length'];
if (size > maxSize) {
console.log('Resource size exceeds limit (' + size + ')');
return callback('image too big');
}
...
Note that we haven't started saving it to a file yet so no abort or unlink is necessary at this point.
As some of you might've guessed, I'm using the Node.JS callback style here, where the callback's first argument is the error argument, which contains the error when there is one, and null when no error occurred.
We've decided to download, what's next?
We should start keeping count of how much have we downloaded, and try to deduce the file type.
Deducing the file type is actually pretty easy using magic numbers. We just get a bunch of file type signature magic numbers from for example here and for the first few bytes of the stream, we look for the magic numbers. If a match is found, we make a note of the file type and continue downloading. Otherwise we quit and remove the few bytes we've already downloaded.
var fileTypes = {
'png': '89504e47',
'jpg': 'ffd8ffe0',
'gif': '47494638'
};
size = 0; //declared in the previous code block
//Generate a random 10 character string
var filename = getName();
var filepath = imagesPath + filename;
//Open up a file stream for writing
var file = fs.createWriteStream(filepath);
var res = request({ url: url});
var checkType = true;
var type = '';
res.on('data', function(data) {
//Keep track of how much we've downloaded
size += data.length;
if(checkType && size >= 4) {
//Wow. WordPress syntax highlighting really breaks badly here.
var hex = data.toString('hex' , 0, 4);
for(var key in fileTypes) {
if(fileTypes.hasOwnProperty(key)) {
if(hex.indexOf(fileTypes[key]) === 0) {
type = key;
checkType = false;
break;
}
}
}
if(!type) {
//If the type didn't match any of the file types we're looking for,
//abort the download and remove target file
res.abort();
fs.unlink(filepath);
return callback('not an image');
}
}
if (size > maxSize) {
console.log('Resource stream exceeded limit (' + size + ')');
res.abort(); // Abort the response (close and cleanup the stream)
fs.unlink(filepath); // Delete the file we were downloading the data to
//imageTooBig contains a path to a placeholder image for bigger images.
//Also set shouldProcess to false, we don't want to process the placeholder
//image later on
return callback(null, {path: imageTooBig, shouldProcess: false});
}
}).pipe(file); //Pipe request's stream's output to a file.
//When download has finished, call the callback.
res.on('end', function() {
callback(null, {filename: filename, shouldProcess: true, type: type});
})
I encourage you to read the comments for better info on what each line does. If something is still unclear, feel free to ask in the comments section at the end of the article.
File downloaded, let's process it
The minifying function is pretty straightforward. As to how I came up with it, I googled up the most common ways to reduce file size for all three image types (png, gif, jpg). Most of the results were about ImageMagick so I looked up the GraphicsMagick equivalents, since it's supposed to be faster in most operations.
For gifs I decided to just grab the first frame (hence the + '[0]' for path), since at tsatter.com I will be setting up a system where mouseovering a gif starts playing the original one.
I also decided to resize the images to 500x500px, but if you don't want that, you can just remove the .resize(...) line from each case. By the way, the '>' at the end of the resize line means that it won't resize the image if it's already smaller than the wanted size.
var thumbnailDimensions = {
width: 500,
height: 500
};
var minifyImage = function(obj, callback) {
//The downloaded original file was saved without extension
//Here we save the new processed file with the extension.
var origPath = imagesPath + obj.filename;
var path = origPath + '.' + obj.type;
var filename = obj.filename + '.' + obj.type;
switch(obj.type) {
case 'jpg':
gm(origPath)
.interlace('Plane')
.quality(85)
.resize(thumbnailDimensions.width, thumbnailDimensions.height + '>')
.noProfile()
.write(path, function(err) {
if(err) {
console.log('err');
console.log(err);
}
else {
callback(filename);
}
});
break;
case 'png':
gm(origPath)
.colors(256)
.quality(90)
.bitdepth(8)
.resize(thumbnailDimensions.width, thumbnailDimensions.height + '>')
.noProfile()
.write(path, function(err) {
if(err) {
console.log('err');
console.log(err);
}
else {
callback(filename);
}
});
break;
case 'gif':
gm(origPath + '[0]')
.resize(thumbnailDimensions.width, thumbnailDimensions.height + '>')
.noProfile()
.write(path, function(err) {
if(err) {
console.log('err');
console.log(err);
}
else {
callback(filename);
}
});
break;
}
};
The result? Even without the resize the file size usually drops by over 50% without too much decrease in quality. This could still be improved a lot, so if you happen to know more about this, please let me and others know in the comments below!
You could also use all other features of the gm library here. For example I've been thinking about fusing a barely noticeable and see through shape that looks like a "play" button straight into the gif frame for tsatter.com. Or you could add your website's watermark. Or maybe programmatically generate comics out of a bunch of images? I don't know.
Putting it all together
Alright, for the last part we're going to set up the functions for transmitting messages between the main thread and the worker thread. Luckily in Node.JS and cluster this is really easy. Both threads have a function for sending messages to the other, and both threads can set up a function that receives messages from the other. We send and receive normal JavaScript objects. I promised it was going to be easy!
Worker thread
Let's introduce the final piece of code that belongs in the worker thread side of the system. Remember, this last excerpt and all the ones before it belong to their own file that is dedicated only for the worker process.
process.on('message', function(msg) {
download(msg.url, function(err, obj) {
if(err) {
return;
}
var resObj = {
src: msg.url,
type: obj.type
};
if(obj.shouldProcess === true) {
minifyImage(obj, function(filepath) {
resObj.thumbnail = filepath;
process.send(resObj);
})
}
else {
//Use the downloaded file as the thumbnail in this case
resObj.thumbnail = obj.filename;
process.send(resObj);
}
});
});
- process.on(...) sets up a new listener function for messages from the main thread.
- process.send(object) sends messages to the main thread.
Real simple.
Main thread
The code in this section will go to the file or files that you run directly using node (or what is required or otherwise included in those files)
We set up a reference to the worker process and set what file do we run as the worker process.
var cluster = require('cluster');
cluster.setupMaster({
exec: __dirname + '/imageProcessWorker.js'
});
Next, we command cluster.fork() to spawn a new worker process. For this article we only really need one worker, but if you have more than two cores you could spawn more of them and set up a way to decide which worker needs more work, but this is outside the scope of this post.
Also, we set up a message handler that receives objects from the worker thread. As an example, in the message handler at tsatter.com I save the file paths I receive from the worker thread to a database and inform clients about a new media delivery being ready.
var worker = cluster.fork();
worker.on('message', messageHandler);
function messageHandler (msg) {
//Do something with the information from worker thread.
}
And as the very last code excerpt, processUrl is the function that I call from my actual server code. worker.send(...) is then used to send an object to the worker thread for processing
var processUrl = function(url) {
worker.send({
url: url
});
};
Conclusion-- A short summary of the program flow
- processUrl is called with a URL. processUrl sends it to the worker thread using worker.send(...)
- URL is received at the worker thread at process.on('message', ...)
- Downloading is considered and possibly attempted at the download function.
- if it fails or we just don't want to download it at all, we just stop all processing and never notify the main thread of anything.
- If the download succeeded and the shouldProcess variable is set to true, the minifyImage(...) function is called
- if the image doesn't need processing, we skip the rest of the steps and just send it to the main thread using process.send(...)
- After minifying is done we send the results to the main thread using process.send(...)
- Results are received in the messageHandler function, and something is hopefully done with them!
The End
Here are the promised links to the full source:
- Main thread: https://github.com/Tsarpf/Tsatter/blob/master/nodeapp/environment/server/imageProcessor.js
- Worker thread: https://github.com/Tsarpf/Tsatter/blob/master/nodeapp/environment/server/imageProcessWorker.js
That should be it. Hopefully it's of some use to someone. Thanks for reading!