robots-aware-crawl
Crawl a small set of pages with depth/page limits, optional robots.txt handling, and normalized page output.
Why install it
Research and ingestion agents often need more than a single fetch. This tool packages the basic crawl loop, link normalization, and robots-aware filtering into one reusable primitive.
Inputs
start_urls: seed URLs to begin crawling fromallowed_domains: optional list of domains the crawler may visitmax_pages: maximum number of pages to fetchmax_depth: maximum crawl depth from the start URLssame_origin_only: restrict links to the same origin as the seed URLrespect_robots: whether to consultrobots.txtinclude_patterns: optional regex patterns a URL must match to be crawledexclude_patterns: optional regex patterns that disqualify a URL
Outputs
pages: fetched pages with title, excerpt, depth, and discovered linksvisited_count: number of pages actually fetchedskipped: skipped URLs with reasonserrors: fetch or parse failures encountered during the crawlmetadata: crawl summary information
Local development
The source code for this tool can be found here
Test:
npm test
Build:
npm run build
Example invocation
printf '%s' '{"start_urls":["https://example.com/docs"],"max_pages":5,"max_depth":1,"respect_robots":true}' | node dist/index.js