Working with binary data in Node.js

Recently I spent some time working with streams of binary data in Node.js. Here are some lessons I learned as a result.

Binary length

Node.js makes it easy to work with strings because under the hood everything is stored in 16-bit characters, which means it can represent ASCII and UTF-8 (and even UTF-16) characters natively. Consequently:

'Cafe'.length === 'Café'.length;
// true

When working with binary data things get a little bit more complicated:

Buffer.byteLength('Cafe') === Buffer.byteLength('Café');
// false

This is because it takes 1 byte to represent 'e' but 2 bytes to represent 'é'. Therefore, it is important to use the byte length, rather than string length, when encoding your data.

Encoding the data

When working with streams of binary data it is necessary to somehow identify individual messages within the stream.

The method I used was to encode the data length-value format (a simplified version of type-length-value encoding format) that consisted of a fixed-size length field followed by variable-size block of data.

Here is an example of creating a message:

var message = 'ALL YOUR BASE',
    length = Buffer.byteLength(message),
    // 4 bytes = 32 bits
    buffer = new Buffer(4 + Buffer.byteLength(message));

buffer.writeUInt32BE(length, 0);
buffer.write(message, 4);

buffer;
// <Buffer 00 00 00 0d 41 4c 4c 20 59 4f 55 52 20 42 41 53 45>

The receiver would then read first read the first 4 bytes of the stream to find out the message length in bytes and then read that number of bytes from the stream to get the message data.

// example assumes that 'buffer' contains the entire message
// don't assume this in real code ;-)
var length = buffer.readUInt32BE(0, 4),
    message = buffer.slice(4, length + 4).toString();

message;
// 'ALL YOUR BASE'

The process would then be repeated as new data arrives.

Chunking

You will of course have noticed the warning in the example above. In reality, given a reasonable amount of data, the receiver will get this data in chunks. These chunks are buffers filled with binary data, but there is no guarantee that the data inside contains the entire message, the entire length field, or even the entire character (in the case of a multi-byte character)!

Consider the following example in which our simulated server converts incoming chunks into a string on receipt. This is a technique frequently seen in examples of Node.js TCP servers.

var assert = require('assert'),
    events = require('events');

// message data
var message = 'naïveté',
    buffer = new Buffer(message);

// split the buffer in the middle of 'ï' character
var chunk1 = buffer.slice(0, 3),
    chunk2 = buffer.slice(3);

// simulate a server using a basic event emitter
var server = new events.EventEmitter();

// convert incoming chunks to a string on receipt
// a common technique in examples on the Internet! ;-)
server.on('data', function (chunk) {
    this._data = this._data || '';
    this._data += chunk.toString();
});

server.on('end', function () {
    assert(this._data === message);
});

// send data
server.emit('data', chunk1);
server.emit('data', chunk2);
server.emit('end');

If you run this example you will see that the assert is triggered because the message was corrupted by the receiver.

To avoid this corruption the receiver must get all message data before converting it to a string.

// collect incoming chunks
server.on('data', function (chunk) {
    this._chunks = this._chunks || [];
    this._chunks.push(chunk);
});

// convert the chunks to string once all data is available
server.on('end', function () {
    this._data = Buffer.concat(this._chunks).toString();
    assert(this._data === message);
});

It is left as an interesting exercise for the reader to implement a receiver that would accept a stream of messages. ;)