Extraction Pipeline

Soul extraction transforms a corpus of tweets into a structured soul.md document through a two-pass LLM pipeline.

Input

Standard path — up to 3,200 tweets via the X API (most recent, sorted by engagement)
Archive path — full Twitter data export for deeper extraction (demo/premium)

Retweets are filtered out. Remaining tweets are sorted by engagement (likes + retweets) so the most representative content is prioritized. Replies are kept but flagged — they reveal relationship patterns and debate positions.

Pass 1: Categorize (Haiku)

Tweets are batched into groups of ~500 and processed in parallel (up to 5 concurrent). Each batch goes through Claude Haiku with a categorization prompt that extracts:

Themes — specific recurring topics (not “crypto” but “MEV resistance”, “DAO governance failures”)
Values — what they defend, promote, care about consistently
Positions — strong opinions with the actual stance, not just the topic
Communication patterns — blunt vs diplomatic, questions vs declarations
Relationships — communities, allies, tribes, arguments
Boundaries — what they reject, block, refuse to engage with
Decision signals — priorities, tradeoffs, sacrifice patterns

Haiku is chosen for categorization because it's fast, cheap, and good enough for pattern extraction. The synthesis step (which requires judgment and prose quality) uses a more capable model.

Pass 2: Synthesize (Sonnet)

All batch analyses are merged and fed to Claude Sonnet with a synthesis prompt. The prompt instructs Sonnet to write in second person (“you”), be radically specific, and produce a document where every statement is falsifiable.

The output must fit within 10KB (the onchain storage limit). If the first pass exceeds this, Sonnet is asked to condense while preserving structure. As a last resort, the document is truncated at the nearest clean line break.

Cost

Component	Model	Typical Cost
Categorization (all batches)	Haiku 3.5	~$0.08
Synthesis	Sonnet 4	~$0.15
Condensation (if needed)	Sonnet 4	~$0.10
Total		~$0.23

Quality Loop

The generated soul.md is presented in an editable markdown editor before minting. Users can refine, cut, or rewrite any section. The same editor is available for post-mint updates via tba.execute().

Quality validation: feed the soul.md to any LLM as a system prompt, ask questions the person hasn't publicly answered, and check if the responses pass the vibe check — not just factually plausible but tonally and ethically correct. A good soul.md produces consistent identity regardless of which model reads it.