Performance
Quick stats
Quick stats in optimial conditions:
- jobs executed per second: ~183,000
- average latency from add_job to job execution start: 4.16ms (max: 13.84ms)
- jobs queued per second from single add_jobs batch call: ~202,000
- time to start and immediately shut down the worker: 68ms
The above stats were achieved with this configuration:
const preset = {
worker: {
connectionString: "postgres:///graphile_worker_perftest",
fileExtensions: [".js", ".cjs", ".mjs"],
concurrentJobs: 24,
maxPoolSize: 25,
// Batching options (see below)
localQueue: { size: 500 },
completeJobBatchDelay: 0,
failJobBatchDelay: 0,
},
};
Performance statement
graphile-worker is not intended to replace extremely high performance
dedicated job queues for Facebook scale, it's intended to give regular
organizations the fastest and easiest to set up job queue we can achieve without
needing to expand your infrastructure beyond Node.js and PostgreSQL. But this
doesn't mean it's a slouch by any means — it achieves an average
latency from triggering a job in one process to executing it in another of under
5ms, and a well-specced database server can queue around 172,000 jobs per second
from a single PostgreSQL client, and can process around 196k jobs per second
using a pool of 4 Graphile Worker instances, each with concurrency set to 24.
For many organizations, this is more than they'll ever need.
Horizontal scaling
graphile-worker is horizontally scalable to a point. Each instance has a
customizable worker pool, this pool defaults to size 1 (only one job at a time
on this worker) but depending on the nature of your tasks (i.e. assuming
they're not compute-heavy) you will likely want to set this higher to
benefit from Node.js' concurrency. If your tasks are compute heavy you may
still wish to set it higher and then using Node's child_process or
worker_threads to share the compute load over multiple cores without
significantly impacting the main worker's run loop.
Batching
Graphile Worker is limited by the performance of the underlying Postgres database; when you hit this limit, performance will start to go down (rather than up) as you add more workers. To mitigate this, batching functionality can be enabled via the configuration. If you have a high throughput job queue (even intermittently) we recommend that you enable batching; the settings to use will depend on your setup, but consider these as a starting point:
let concurrentJobs = 10; // Set this to whatever suits
const preset = {
worker: {
concurrentJobs,
// Sensible default for local queue size: one more than concurrency
localQueue: { size: concurrentJobs + 1 },
completeJobBatchDelay: 250,
failJobBatchDelay: 250,
},
};
Read on for more details.
localQueue
The local queue feature is disabled by default, but it can have a significant impact on high throughput queues, both reducing database load and increasing throughput - sometimes by an order of magnitude!
The local queue enables each pool to pull down a configurable number of jobs up front so its workers can start a new job the moment their previous one completes without having to request a new job from the database. This batching also reduces load on the database since there are fewer total queries per second, and table scans are allowed to return additional results. However, it's a trade-off since more jobs are checked out (locked) but not necessarily actively being worked on, so:
- if a worker doesn't exit gracefully (e.g. it crashes or is forcefully killed), more jobs will remain locked and unable to execute until the 4 hour limit expires. (Mitigation: Graphile Worker Pro.)
- execution latency may increase if jobs exist in one worker's local queue
whilst another worker sits idle. (Mitigation:
preset.worker.localQueue.ttldetermines how long tasks may sit in the local queue without being worked on.)
If your tasks are somewhat slow (taking many tens of seconds or more) and your
throughput is very low or you need high priority tasks to be executed ASAP then
you should set your localQueue size to either a low number (2+) or disable it
entirely (-1). When doing so, you can leave completeJobBatchDelay and
failJobBatchDelay enabled.
completeJobBatchDelay / failJobBatchDelay
These methods cause job releasing (complete/fail) to become asynchronous, allowing multiple completes/fails in a small window of time to be released via the same roundtrip to the database, significantly reducing load on the database and WAL churn.
The trade-off is that jobs will not be released immediately, so in the event of a catastrophic failure (worker crash or forced termination) more jobs may be left in the locked state than otherwise. So long as you ensure that your workers always exit cleanly/gracefully, these delays can significantly reduce database load and improve throughput with minimal additional risk.
In general, we'd advise all users to enable these settings, even if they are set
to 0 for minimal delay. 250 seems a reasonable default - release jobs at
most once every quarter of a second.
Running the performance tests
To test performance, you can check out the repository and then run
yarn perfTest. This runs three tests:
- a startup/shutdown test to see how fast the worker can startup and exit if there's no jobs queued (this includes connecting to the database and ensuring the migrations are up to date)
- a load test — by default this will run 200,000
trivial
jobs with a parallelism of 4 (i.e. 4 node processes) and a concurrency of 24
(i.e. 24 concurrent jobs running on each node process), but you can configure
this in
perfTest/run.js. (These settings were optimized for a Intel i9-14900K with efficiency cores disabled and running both the tests and the database locally.) - a latency test — determining how long between issuing an
add_jobcommand and the task itself being executed.
perfTest results:
Executed on this machine, running both the workers and the database (and a tonne of Chrome tabs, electron apps, and what not).
With batching
Jobs per second: ~184,000
const preset = {
worker: {
connectionString: "postgres:///graphile_worker_perftest",
fileExtensions: [".js", ".cjs", ".mjs"],
concurrentJobs: 24,
maxPoolSize: 25,
// Batching options (see below)
localQueue: { size: 500 },
completeJobBatchDelay: 0,
failJobBatchDelay: 0,
},
};
Timing startup/shutdown time...
... it took 68ms
Scheduling 200000 jobs
Adding jobs: 988.425ms
... it took 1160ms
Timing 200000 job execution...
Found 999!
... it took 1156ms
Jobs per second: 183895.49
Testing latency...
[core] INFO: Worker connected and looking for jobs... (task names: 'latency')
Beginning latency test
Latencies - min: 3.24ms, max: 18.18ms, avg: 4.28ms
Without batching
Jobs per second: ~15,600
const preset = {
worker: {
connectionString: "postgres:///graphile_worker_perftest",
fileExtensions: [".js", ".cjs", ".mjs"],
concurrentJobs: 24,
maxPoolSize: 25,
// Batching disabled (default)
localQueue: { size: -1 },
completeJobBatchDelay: -1,
failJobBatchDelay: -1,
},
};
Timing startup/shutdown time...
... it took 77ms
Scheduling 200000 jobs
Adding jobs: 992.368ms
... it took 1163ms
Timing 200000 job execution...
Found 999!
... it took 12892ms
Jobs per second: 15606.79
Testing latency...
[core] INFO: Worker connected and looking for jobs... (task names: 'latency')
Beginning latency test
Latencies - min: 3.40ms, max: 14.13ms, avg: 4.47ms
TODO: post perfTest results in a more reasonable configuration, e.g. using an RDS PostgreSQL server and a worker running on EC2.