markdown-chunk
Split markdown or plain text into deterministic chunks with heading context and offsets.
Why install it
Chunking is a repeated preprocessing step for retrieval, summarization, and memory ingestion. This tool gives you a portable chunk contract instead of one-off chunking logic per agent.
Inputs
text: markdown or plain text to chunkstrategy:heading,paragraph, orhybridmax_chars: target maximum characters per chunkoverlap: trailing character overlap carried into the next chunk when a split occurssource_id: optional source identifier copied into emitted chunks
Outputs
chunks: ordered chunk list with text, heading path, offsets, char count, and stable IDmetadata: summary information including chunk count, strategy, overlap, and fallback order
How it chunks
The tool tries to preserve structure first, then falls back only when content would exceed max_chars:
- headings
- paragraphs
- sentences
- fixed-size windows
When a chunk overflows, the configured overlap is carried into the next chunk for continuity.
Local development
The source code for this tool can be found here
Test:
python -m unittest discover -s tests -p 'test_*.py'
Example invocation
python -u markdown_chunk/__main__.py < input.json
With input.json containing:
{
"text": "# Intro\n\nHello world",
"strategy": "hybrid"
}