Locust

Locust

  • Docs
  • API
  • CLI
  • GitHub

Locust

Distributed web data discovery and collection framework
npm install @achannarasappa/locust
Features

Use Cases

  • Configuration driven jobs
  • Distributed execution model
  • Handle client-side JavaScript execution
  • Data extraction using CSS selectors
  • Depth-based stop condition along with support for custom stop condtions
  • Robust dev tooling to build and test jobs
  • Web indexing (i.e. web crawling)
  • Web data extraction (i.e. web scraping)
Configuration Driven

Define what data and where to find it rather than how

Job Reference • Scraping Guide
module.exports = job = {
  extract: async ($) => ({
    'title': await $('title'),
  }),
  url: 'http://ecommerce-site.com',
};
module.exports = job = {
  ...
  // AWS Lambda
  start: () => (new require('aws-sdk').Lambda()).invoke({ 
    FunctionName: "locust-job",
    InvocationType: "Event", 
   }),
  // Google Cloud Functions
  start: () => require('child_process').execSync('gcloud functions call locust-job'),
  // Linux/Windows Process
  start: () => require('child_process').execSync('node -e "(async () => require("./job").start(job))()"'),
  // NodeJS
  start: () => require('locust').execute(job),
  ...
};
Built for Serverless

Jobs run independently in separate threads, processes, or cloud functions with Redis as a centralized queue. Just define how to start a new job.
Architecture Reference • Deployment Guide
Lifecycle Reference
Client-side Javascript Execution

Handle crawling or scraping single-page applications (SPAs) including those based in AngularJs, React, and Vue.js
Lifecycle Reference • SPA Example
module.exports = job = {
  extract: async ($, page) => {
    await page.waitFor('.profile');
    return {
      firstName: await $('.profile > .first_name'),
      lastName: await $('.profile > .last_name'),
    }
  },
};
Powerful Devtools

Comprehensive set of command line tools to accelerate development of locust jobs
  • Generator
  • Runner
  • Validator
CLI Reference
Locust
Docs
Getting StartedAPI ReferenceCLI Reference
GitHub
RepositoryIssues
Copyright © 2020 Ani Channarasappa