Node.js fs.ReadStream Performance Issues

fs.createReadStream allocates large Buffer objects on the fly, which can have a significant impact on performance. Instead, acquiring buffers from a pool and returning them when done can boost performance by 67% over a wide range of file sizes.

$ bench-runner -g "file -> null"

About fs.ReadStream

Internally, fs.createReadStream uses the class fs.ReadStream. In it’s _read() implementation, ReadStream allocates Buffer objects on as it reads through the file. These are later discarded by the downstream Writables to be picked up by GC. With larger files, this can result in a lot of time spent allocating Buffer objects only to have them picked up in GC, which in by itself can add up to a lot.

_read()
ReadStream.prototype._read = function(n) {
  ...
  if (!pool || pool.length - pool.used < kMinPoolSpace) {
    // discard the old pool.
    allocNewPool(this._readableState.highWaterMark);
  }
  ...
allocNewPool()
function allocNewPool(poolSize) {
  pool = Buffer.allocUnsafe(poolSize);
  pool.used = 0;
}

Buffer Pool

A buffer pool would hold multiple resources such as Buffer objects. Clients such as fs.ReadStream would borrow buffers and have them returned when no longer needed. The pool would be have a cap on the total number of resources it manages. An eviction scheme could be used to reduce allocated capacity when load is low.

A Buffer Pool Scheme for Node Streams

A buffer pool implementation for Node streams will need some thought since borrowing and release of Buffer objects occur at differing points in the stream.

Reference Counting Buffer Objects

One possibility is to maintain a reference count on Buffer objects created by the pool and exposing _ref() and _unref() methods on the Buffer. The pool calls _ref() before loaning out buffer objects. When the _refCount drops to 0, a release() method is invoked to return the Buffer back to the pool.

release() is responsible for returning the buffer back to the pool.
function monkey_patch_buffer_object( obj, release ) {
  obj._refCount = 0;
  obj._ref = () => obj._refCount++;
  obj._refRelease = release ? release : () => {};
  obj._unref = () => {
    if ( obj._refCount > 0 ) {
      obj._refCount--;
      if ( obj._refCount === 0 ) {
        obj._refRelease( obj );
      }
    }
  };
}

Appropriate reference counting logic is needed in ReadStream.push and WriteStream.write:

Readable
Readable.prototype.push = function ( chunk, encoding ) {
    ...
   if ( chunk && typeof chunk._ref === 'function' ) {
      chunk._ref();
    }
    ...
};
Writable
Writable.prototype.write = function ( chunk, encoding, cb ) {
  ...
  if ( !(chunk && typeof chunk._unref === 'function') ) {
    return this._unref();
  }
  ...
 }
_writev() / handling multiple buffers needs to be looked at to ensure _unref() is invoked correctly.

The Test

In this test, we simulate the use of a buffer pool but reusing a Buffer object over the lifetime of the read. While this will corrupt the data being read, it is a quick way establish the potential gain in performance by use of a pool.

We read files of varying sizes and write them to a null writer. The test settings are:

  • encoding: Chunks are encoded as Buffer objects to avoid conversion overhead.

  • highWaterMark: Set to 64KiB by fs.ReadStream as an internal default.

  • file size: Ranges from 1KiB to 1GiB, doubling at each step.

About the Graph
  • Performance: higher is better.

  • Both axes are log10.

81
About the Graph
  • Shows percentage boost in performance with a simulated buffer pool.

  • X axis is log10.

  • Y axis is linear.

83

Observations

  • The simulated buffer pool boosts performance across the tested file sizes by an average of 67%.

Node.js Transform Streams Performance

What’s the impact of adding one or more TransformStream objects in a Node.js stream?

$ bench-runner -g "passthrough.*false.*default.*buffer"

The Test

Here we look at the impact on our baseline memory to null stream.

streams passthrough

The test settings are:

  • encoding: Chunks are encoded as Buffer objects. As we noted earlier, doing so won’t incur any conversion overhead and in the absence other computational effort in the stream, performance should not change when chunk size is varied.

  • chunk size: Ranges from 2KiB to 512KiB, doubling at each step.

  • highWaterMark:

    • default: Use the default highWaterMark (16KiB)

    • low: Force the highWaterMark to be always lower than the chunk size

    • high: Force the highWaterMark to be always higher than the chunk size

About the Graphs
  • count: The number of PassThrough transform streams.

  • Performance: higher is better.

  • Both axes are log10

Default High Water Mark

default hwm

See this for a discussion on why the traces dip around 16KiB.

High Water Mark > Chunk Size

high hwm

High Water Mark < Chunk Size

low hwm

In Object Mode

object mode

Thoughts

  • Across the board, performance drops when more Transform streams are added. This isn’t surprising since both ReadableStream and WritableStream use process.nextTick for synchronous callbacks (see Performance of nextTick).

  • Bottomline: Keep your pipeline depth small.

Node.js Stream Performance with Strings

How fast can we push the same String object to a null writer? What is the impact of varying string length, high water mark, and object mode?

$ bench-runner -g "iter.*string"

The Test

In this test, we push string objects from a memory reader to a null writer, using the following settings:

  • encoding: Chunks are encoded as string objects.

  • highWaterMark:

    • default: Use the default highWaterMark (16KiB).

    • low: Force the highWaterMark to be always lower than the string length.

    • high: Force the highWaterMark to be always higher than the string length.

  • string length: From 2KiB to 512KiB in steps of powers of 2.

About the Graphs
  • Performance: Higher is better.

  • Both axes are log10

string

Observations

  • Pushing string objects incurs computational overhead due to conversion to Buffer objects internally. Therefore unlike the previous case, chunk size matters and performance is not constant vs string size.

  • Chunk Size vs highWaterMark

    • Performance is slightly better when chunk size < highWaterMark (high plot) for lower chunk sizes.

    • Likewise, performance is slightly lower when chunk size > highWaterMark, for lower chunk sizes.

    • In the case of the default highWaterMark, the trace follows high up to 16KiB after which it follows the low plot.

  • Operating the stream in object mode restores performance back to those seen with pushing Buffer objects since strings are no longer encoded/decoded.

Node.js Stream Performance with Buffer Objects

How fast can we push the same Buffer object to a null writer? What is the impact of varying chunk size, high water mark, encoding and object mode?

$ bench-runner -g "iter.*buffer"

Memory Reader → Null Writer

Our base case is a trivially simple pipeline consisting of a reader that pushes the same chunk of memory (a memory reader) to a writer which simply discards them (a null writer). By reusing memory, the memory reader avoids any memory allocation overhead.

Memory Reader

MemoryReader is a Readable stream which uses a generator/iterator to push Buffer objects. The generators returns the same memory chunk for every call to next() to avoid the overhead of allocating memory.

function next( stream ) {
  var next = stream._generator.next();
  return stream.push( next.done ? null : next.value );
}

GeneratorReader.prototype._read = ( n ) =>  {
  while ( next( this ) ) {}
};

Null Writer

NullWriter is a Writable stream which accepts chunks and does nothing with them.

NullWriter.prototype._write = ( chunk, enc, cb )  => cb();

The Test

A memory reader is configured to iteratively push a chunk of data null writer.

streams baseline

We use the following settings:

  • encoding: Chunks are encoded as buffers to avoid conversion overhead.

  • chunk size: Ranges from 2KiB to 512KiB, doubling at each step.

  • highWaterMark:

    • default: Use the default highWaterMark (16KiB).

    • low: highWaterMark = chunk size / 2, always lower than chunk size.

    • high: highWaterMark = chunk size * 2, always higher than chunk size.

About the Graphs
  • Performance: higher is better.

  • Both axes are log10.

48

Observations

  • Since we’re pushing Buffer objects, with no other significant computational activity on the data, performance is invariant of chunk size.

  • Chunk Size vs highWaterMark

    • Performance is best when highWaterMark > chunk size (high trace).

    • Likewise, performance is always poor when highWaterMark < chunk size (low trace), since buffers are sliced and copied to match downstream high water mark requirements.

    • When highWaterMark is default, performance follows the high trace up to the default value of 16KiB, after which is drops and follows the low trace.

  • In object mode, performance is invariant of high water mark and chunk size, with a marginal boost over the best non-object mode case. In this mode, all buffer slicing/copying is bypassed and buffers are simply queued and handed downstream.

Thoughts

  • The relationship between chunk size and high water mark matters.

  • Since there’s no parameter negotiation process nor any pipeline-wide configuration, settings must be tracked and matched per stream in the pipeline. This is particularly true in the case of operating in object mode.

Node.js Streams Performance

In this series of posts, we will be looking at the performance of Node.js streams. We want to see the relative speed of a stream component or a pipeline of stream nodes can run, when stream parameters encoding, chunk size, high water mark and object mode are varied. We will also look at performance under varying pipeline depth.

Methodology

We use benchmark.js as a primary method for determining performance. We also profile tests with Node’s profiling tools to ensure the results are valid (bulk of time is spent towards test function, vs setup/teardown, garbage collection, etc).

Table 1. Test Machine

Model

Mac Pro

Processor

Quad-Core Intel Xeon E5, 3.7 GHz

L2 Cache (per Core)

256 KB

L3 Cache

10 MB

Memory

64 GB

OS

macOS Sierra 10.12.5

Reproducing Tests

Install bench-runner:

$ npm install -g bench-runner

Clone this repository:

$ git clone https://github.com/venkatperi/stream-benchmarks
$ cd stream-benchmarks

Run bench-runner (this will take some time):

$ bench-runner -g <see test for filter>

Node.js nextTick & setImmediate Performance

How fast would our code run if it deferred one or more times with process.nextTick() or setImmediate() (see event loop for background)?

$ bench-runner -g "nextTick|setImmediate"

The Test

We stack up one or more calls and time it. Simple.

const fapply = ( f, n, arg ) =>
      f( n > 1
          ? () => fapply( f, n - 1, arg )
          : () => arg() );
About the Graphs
  • Higher performance means faster allocation.

  • Both axes are log10

process.nextTick

suite( 'nextTick', () => {
  for ( let i = 1; i <= maxCount; i++ ) {
    bench( i, ( cb ) =>
      fnapply( process.nextTick, i, cb ) );
  }
} );

nextTick

setImmediate
suite( 'setImmediate', () => {
  for ( let i = 1; i <= maxCount; i++ ) {
    bench( i, ( cb ) =>
      fnapply( setImmediate, i, cb ) );
  }
} );

setImmediate

Thoughts

  • process.nextTick runs faster and it should. But we’re aren’t comparing.

  • Deferring is part of life in Node’s cooperative world. However, if you have a deep defer stack in the middle of your job, expect a bit of a slowdown, a.k.a. increased latency.

Node.js Buffer Allocation Performance

How fast can Node.js allocate a 2KiB Buffer? What about a 128MiB Buffer?

$ bench-runner -g "buffer allocation"

The Test

We allocate Node.js Buffer objects of increasing sizes from 2KiB to 128KiB, doubling at each step. Buffers are allocated via Buffer.allocUnsafe().

About the Graph
  • Higher performance means faster allocation.

  • Both axes are log10.

performance

Observations

  • Node allocates smaller Buffers faster, clearly. Under the hood, Node calls malloc(3). While numerous wars have been waged on the performance of malloc() vs alternatives, the fact remains that operating systems allocate more time to allocate larger buffers.

Thoughts

  • If you need large Buffer objects on a regular basis, you’ll may hit a significant performance bottleneck if you allocate on demand and GC them when you’re done. Instead, consider using a buffer pool.

bench-runner

bench-runner is a benchmark.js runner for Node.js like mocha.

Install with npm:

$ npm install bench-runner -g
$ mkdir benches
$ $EDITOR benches/string.js #open with your favorite editor

In your editor:

suite( 'find in string', () => {
  bench( 'RegExp.test', () => /o/.test( 'Hello World!' ) );
  bench( 'String.indexOf', () => 'Hello World!'.indexOf( 'o' ) > -1 );
  bench( 'String.match', () => !!'Hello World!'.match( /o/ ) );
} );

Back in the terminal:

$ bench-runner -f fastest
[find in string]
  RegExp.test x 11,841,755 ops/sec ±3.00% (89 runs sampled)
  String.indexOf x 30,491,086 ops/sec ±0.45% (92 runs sampled)
  String.match x 8,287,739 ops/sec ±2.57% (88 runs sampled)
fastest: String#indexOf

A Groovy lib for Xcode pbxproj files

See github
// snippet from the test (spock) spec:

    def &quot;object types and keys are ok&quot;( String key, String klass ) {
        expect:
        proj.objects[ key ].class.simpleName == klass

        where:
        key                        | klass
        'F98F991811A4A86400D21E1F' | 'PBXBuildFile'
        '0597689803D6472D00C9149F' | 'PBXFileReference'
        '059768A803D6494200C9149F' | 'PBXFileReference'
        '05CA34F70433CFDF00C9149F' | 'PBXFileReference'
        'F98F991611A4A85000D21E1F' | 'PBXFileReference'
        'F98F991411A4A85000D21E1F' | 'PBXFrameworksBuildPhase'
        '0597688C03D6465000C9149F' | 'PBXGroup'
        '0597689703D646C100C9149F' | 'PBXGroup'
        'F98F991511A4A85000D21E1F' | 'PBXNativeTarget'
        '0597689003D6465000C9149F' | 'PBXProject'
        'F98F991311A4A85000D21E1F' | 'PBXSourcesBuildPhase'
        '05B1F3D8089068690080B6E2' | 'XCBuildConfiguration'
        'F98F991711A4A85100D21E1F' | 'XCBuildConfiguration'
        '05B1F3D7089068690080B6E2' | 'XCConfigurationList'
        'F98F991911A4A8B900D21E1F' | 'XCConfigurationList'
    }

    def &quot;file references are ok&quot;( def key, def path ) {
        expect:
        proj.objects[ key ].path == path

        where:
        key                        | path
        &quot;0597689803D6472D00C9149F&quot; | &quot;keymgr.c&quot;
        &quot;059768A803D6494200C9149F&quot; | &quot;keymgr.h&quot;
        &quot;05CA34F70433CFDF00C9149F&quot; | &quot;testcases/basic-eh-app.cc&quot;
        &quot;F98F991611A4A85000D21E1F&quot; | &quot;libkeymgr.dylib&quot;
    }

    def &quot;verify product group -- PBXRef links&quot;( def index, def path ) {
        given:
        def children = proj.objects[ &quot;0597688C03D6465000C9149F&quot; ].children

        expect:
        children[ index ].path == path

        where:
        index | path
        1     | &quot;keymgr.c&quot;
        2     | &quot;keymgr.h&quot;
        3     | &quot;testcases/basic-eh-app.cc&quot;
    }