API · Locust

function: `execute(jobDefinition)`

locust.execute(jobDefinition)

jobDefinition <Object>

returns: <Promise<Object>> Returns a promise that resolves to a jobResult

// example.js
const { execute } = require('locust');
const execSync = require('child_process');
const job = {
  start: () => execSync('./example.js'),
  url: 'http://localhost:3001',
  config: {
    name: 'collect-data',
    depthLimit: 1,
  },
  connection: {
    redis: {
      port: 6379,
      host: 'localhost'
    },
    chrome: {
      browserWSEndpoint: 'ws://localhost:3000'
    },
  }
};

(() => execute(job))()

Starts a Locust job. On first run, the job runs against the entrypoint url and on subsequent runs, the first queued job is run.

object: `jobDefinition`

jobDefinition <Object>
- url <string> the entrypoint url for the job
- beforeAll <Function>
  optional
- before <Function>
  optional
- after <Function>
  optional
- start <Function>
- extract <Function>
  optional
- config <Object> Defines settings that determine global behavior of Locust
  - name <string> a unique name to identify the job
  - logLevel <Number>
    optional
    RFC5424 log level - logging is disabled if omitted
  - concurrencyLimit <Number> the maximum number of concurrent jobs
  - depthLimit <Number> the maximum link depth from the entrypoint url - when met, the Locust will stop processing additional jobs accross all instances of this job
  - delay <Number>
    optional
    wait time in milliseconds before starting a job after popping it from the queue
- filter <Function|Object>
  optional
  filter links by a hostname or function
- connection <Object>
  - redis <Object> configuration for ioredis to connect to Redis
    - host <string>
    - port <string>
  - chrome <Object> configuration for Puppeteer to connect to Chrome
    - browserWSEndpoint <string> web socket address of a Chrome instance

Configuration object that defines how to connect to Chrome and Redis and how the system behaves.

object: `jobResult`

jobResult <Object>
- cookies <Object>
- data <?Object> Return value of the jobDefinition.extract function if one was defined
- links <Array>
- response <Object>

Object containing the result of the job including the raw response, extracted links, cookies, and extracted data.

object: `jobData`

jobData <Object>
- url <string> address for the job
- depth <Number> page distance of the job from the entrypoint url in the jobDefinition

Minimal job representation used primarily within the Redis queue

object: `snapshot`

snapshot <Object>
- state <'ACTIVE'|'INACTIVE'> current state of Redis queue
- queue <Object> each value contains an array of urls
  - processing <Array<string>>
  - done <Array<string>>
  - queued <Array<string>>

A snapshot of the Redis queue at a given point in time

object: `response`

response <Object>
- ok <Boolean>
- status <Number> HTTP response code
- statusText <string> HTTP response message
- headers <Object>
- url <string> url after following redirects or any page navigation
- body <string> html content of the page

Response from the HTTP request after navigating to the url in jobData or url in the jobDefinition

function: `beforeAll`

jobDefinition.beforeAll(browser, snapshot, jobData)
- browser <Puppeteer.Browser> Puppeteer browser instance
- snapshot <Object> A snapshot of the Redis queue at the time the job was poped from the Redis queue
- jobData <Object> Current job's data

User defined hook to run once before the first job is processed

function: `before`

jobDefinition.before(page, snapshot, jobData)
- page <Puppeteer.Page> Puppeteer page instance
- snapshot <Object> A snapshot of the Redis queue at the time the job was poped from the Redis queue
- jobData <Object> Current job's data

User defined hook to run before every job is processed

function: `after`

jobDefinition.after(jobResult, snapshot, stopQueue)
- jobResult <Object>
- snapshot <Object> A snapshot of the Redis queue after the new links for the current job were added to the queue
- stopQueue <Function> Callback function to send a global stop signal. In flight jobs are not stopped however no further jobs are started

User defined hook to run after every job is processed

...
after: async (jobResult, snapshot, stop) => {

    if (snapshot.queue.done.length >= 5)
      await stop()

  }
...

function: `start`

jobDefinition.start()

User defined hook to define how to invoke a new instance of Locust within the parent context (e.g. AWS Lambda, system process)

function: `extract`

jobDefinition.extract($, browser, jobData)
- $(selector) <Function> convenience function to get the text of an element on the page
  - selector <string> CSS selector e.g. ul li .description
  - returns: <Promise<string>> the text content of the first element at the selector
  - throws BrowserError: when there is no element found at the selector
- page <Puppeteer.Page> Puppeteer current page instance
- browser <Puppeteer.Browser> Puppeteer browser instance
- jobData <Object> Current job's data

User defined hook to extract data from the page

function: `filter`

jobDefinition.filter(links)
- links <Array<string>> Array of links extracted from the href attributes of <a> elements on the page

Filter which links are added to the queue from the page

object: `filter`

filter <Object>
- allowList <Array<string>> list of hostnames to allow
- blockList <Array<string>> list of hostnames to block

Filters which links are added to the queue from the page based on the hostname. Both lists can be used in conjunction.

class: `GeneralJobError`

locust.error.GeneralJobError(message, url)
- message <string>
- url <string>

Thrown when Locust encounters an error that causes it to abort

// example.js
const { execute, error: { GeneralJobError } } = require('locust');
const job = require('./job');

(async () => {
  try {

    await execute(job)

  } catch (e) {

    if (e instanceof GeneralJobError)
      return console.log(e.message);

    throw e;

  }
})()

class: `QueueEndError`

locust.error.QueueEndError(message, url)
- message <string>
- url <string>

Returned when a global queue end condition is met e.g. no more queued jobs remaining or depth limit has been met

class: `QueueError`

locust.error.QueueError(message, url)
- message <string>
- url <string>

Returned when a transient condition its met where another job can not be started e.g. concurrency limit has been met

class: `BrowserError`

locust.error.BrowserError(response)
- message <string>
- url <string>
- response <Object>

Thrown when Chrome encounters an error

function: execute(jobDefinition)

object: jobDefinition

object: jobResult

object: jobData

object: snapshot

object: response

function: beforeAll

function: before

function: after

function: start

function: extract

function: filter

object: filter

class: GeneralJobError

class: QueueEndError

class: QueueError

class: BrowserError