Locust

Locust

  • Docs
  • API
  • CLI
  • GitHub

›Reference

About

  • Overview
  • Roadmap

Usage

  • Getting Started
  • Develop
  • Deploy
  • Operate

Reference

  • API
  • CLI
  • Lifecycle
  • Architecture
  • Concepts

API

function: execute(jobDefinition)

locust.execute(jobDefinition)

  • jobDefinition <Object>

returns: <Promise<Object>> Returns a promise that resolves to a jobResult

// example.js
const { execute } = require('locust');
const execSync = require('child_process');
const job = {
  start: () => execSync('./example.js'),
  url: 'http://localhost:3001',
  config: {
    name: 'collect-data',
    depthLimit: 1,
  },
  connection: {
    redis: {
      port: 6379,
      host: 'localhost'
    },
    chrome: {
      browserWSEndpoint: 'ws://localhost:3000'
    },
  }
};

(() => execute(job))()

Starts a Locust job. On first run, the job runs against the entrypoint url and on subsequent runs, the first queued job is run.

object: jobDefinition

  • jobDefinition <Object>
    • url <string> the entrypoint url for the job
    • beforeAll <Function>
      optional
    • before <Function>
      optional
    • after <Function>
      optional
    • start <Function>
    • extract <Function>
      optional
    • config <Object> Defines settings that determine global behavior of Locust
      • name <string> a unique name to identify the job
      • logLevel <Number>
        optional
        RFC5424 log level - logging is disabled if omitted
      • concurrencyLimit <Number> the maximum number of concurrent jobs
      • depthLimit <Number> the maximum link depth from the entrypoint url - when met, the Locust will stop processing additional jobs accross all instances of this job
      • delay <Number>
        optional
        wait time in milliseconds before starting a job after popping it from the queue
    • filter <Function|Object>
      optional
      filter links by a hostname or function
    • connection <Object>
      • redis <Object> configuration for ioredis to connect to Redis
        • host <string>
        • port <string>
      • chrome <Object> configuration for Puppeteer to connect to Chrome
        • browserWSEndpoint <string> web socket address of a Chrome instance

Configuration object that defines how to connect to Chrome and Redis and how the system behaves.

object: jobResult

  • jobResult <Object>
    • cookies <Object>
    • data <?Object> Return value of the jobDefinition.extract function if one was defined
    • links <Array>
    • response <Object>

Object containing the result of the job including the raw response, extracted links, cookies, and extracted data.

object: jobData

  • jobData <Object>
    • url <string> address for the job
    • depth <Number> page distance of the job from the entrypoint url in the jobDefinition

Minimal job representation used primarily within the Redis queue

object: snapshot

  • snapshot <Object>
    • state <'ACTIVE'|'INACTIVE'> current state of Redis queue
    • queue <Object> each value contains an array of urls
      • processing <Array<string>>
      • done <Array<string>>
      • queued <Array<string>>

A snapshot of the Redis queue at a given point in time

object: response

  • response <Object>
    • ok <Boolean>
    • status <Number> HTTP response code
    • statusText <string> HTTP response message
    • headers <Object>
    • url <string> url after following redirects or any page navigation
    • body <string> html content of the page

Response from the HTTP request after navigating to the url in jobData or url in the jobDefinition

function: beforeAll

  • jobDefinition.beforeAll(browser, snapshot, jobData)
    • browser <Puppeteer.Browser> Puppeteer browser instance
    • snapshot <Object> A snapshot of the Redis queue at the time the job was poped from the Redis queue
    • jobData <Object> Current job's data

User defined hook to run once before the first job is processed

function: before

  • jobDefinition.before(page, snapshot, jobData)
    • page <Puppeteer.Page> Puppeteer page instance
    • snapshot <Object> A snapshot of the Redis queue at the time the job was poped from the Redis queue
    • jobData <Object> Current job's data

User defined hook to run before every job is processed

function: after

  • jobDefinition.after(jobResult, snapshot, stopQueue)
    • jobResult <Object>
    • snapshot <Object> A snapshot of the Redis queue after the new links for the current job were added to the queue
    • stopQueue <Function> Callback function to send a global stop signal. In flight jobs are not stopped however no further jobs are started

User defined hook to run after every job is processed

...
after: async (jobResult, snapshot, stop) => {

    if (snapshot.queue.done.length >= 5)
      await stop()

  }
...

function: start

  • jobDefinition.start()

User defined hook to define how to invoke a new instance of Locust within the parent context (e.g. AWS Lambda, system process)

function: extract

  • jobDefinition.extract($, browser, jobData)
    • $(selector) <Function> convenience function to get the text of an element on the page
      • selector <string> CSS selector e.g. ul li .description
      • returns: <Promise<string>> the text content of the first element at the selector
      • throws BrowserError: when there is no element found at the selector
    • page <Puppeteer.Page> Puppeteer current page instance
    • browser <Puppeteer.Browser> Puppeteer browser instance
    • jobData <Object> Current job's data

User defined hook to extract data from the page

function: filter

  • jobDefinition.filter(links)
    • links <Array<string>> Array of links extracted from the href attributes of <a> elements on the page

Filter which links are added to the queue from the page

object: filter

  • filter <Object>
    • allowList <Array<string>> list of hostnames to allow
    • blockList <Array<string>> list of hostnames to block

Filters which links are added to the queue from the page based on the hostname. Both lists can be used in conjunction.

class: GeneralJobError

  • locust.error.GeneralJobError(message, url)
    • message <string>
    • url <string>

Thrown when Locust encounters an error that causes it to abort

// example.js
const { execute, error: { GeneralJobError } } = require('locust');
const job = require('./job');

(async () => {
  try {

    await execute(job)

  } catch (e) {

    if (e instanceof GeneralJobError)
      return console.log(e.message);

    throw e;

  }
})()

class: QueueEndError

  • locust.error.QueueEndError(message, url)
    • message <string>
    • url <string>

Returned when a global queue end condition is met e.g. no more queued jobs remaining or depth limit has been met

class: QueueError

  • locust.error.QueueError(message, url)
    • message <string>
    • url <string>

Returned when a transient condition its met where another job can not be started e.g. concurrency limit has been met

class: BrowserError

  • locust.error.BrowserError(response)
    • message <string>
    • url <string>
    • response <Object>

Thrown when Chrome encounters an error

Last updated on 6/4/2020 by Ani Channarasappa
← OperateCLI →
  • function: execute(jobDefinition)
  • object: jobDefinition
    • object: jobResult
    • object: jobData
    • object: snapshot
    • object: response
    • function: beforeAll
    • function: before
    • function: after
    • function: start
    • function: extract
    • function: filter
    • object: filter
  • class: GeneralJobError
  • class: QueueEndError
  • class: QueueError
  • class: BrowserError
Locust
Docs
Getting StartedAPI ReferenceCLI Reference
GitHub
RepositoryIssues
Copyright © 2020 Ani Channarasappa