home / data-processing / pup

pup

8.4k

HTML query CLI for selecting nodes with CSS selectors and emitting matching markup, text, attributes, or JSON.

$brew install https://raw.githubusercontent.com/EricChiang/pup/master/pup.rb

Language

Stars

8,399

Category

Data Processing

Agent

Ready

Agent Compatibility

JSON Output

Agent Skill

MCP Support

AI Analysis

pup is a small HTML parsing CLI that reads markup from stdin or a file, applies CSS selectors, and prints the matching nodes. It is useful for lightweight scraping, inspection, and preprocessing when you already have the HTML and do not need a browser session.

What It Enables

Extract specific HTML fragments, text content, attribute values, or match counts from fetched pages or saved documents.
Turn selected nodes into a simple JSON structure for downstream parsing in shell pipelines.
Pretty-print messy markup or narrow a large page down to the subsection another tool or script should inspect next.

Agent Fit

Stdin or file input, explicit flags, and selector-based queries make it easy to compose with curl, saved fixtures, and follow-up shell steps.
json{} provides real machine-readable output, but the schema is limited to node tags, attributes, text, comments, and nested children.
Best as a lightweight HTML extraction primitive inside a larger fetch and parse workflow, not as a complete web interaction surface.

Caveats

It only processes static HTML you provide; it does not fetch pages, run JavaScript, or maintain login state.
The project README has at least one stale behavior note around json{} output shape, so source is a better reference for edge cases.