Extraction Pipeline

Soul extraction transforms a corpus of tweets into a structured soul.md document through a two-pass LLM pipeline.

Input

Retweets are filtered out. Remaining tweets are sorted by engagement (likes + retweets) so the most representative content is prioritized. Replies are kept but flagged — they reveal relationship patterns and debate positions.

Pass 1: Categorize (Haiku)

Tweets are batched into groups of ~500 and processed in parallel (up to 5 concurrent). Each batch goes through Claude Haiku with a categorization prompt that extracts:

  1. Themes — specific recurring topics (not “crypto” but “MEV resistance”, “DAO governance failures”)
  2. Values — what they defend, promote, care about consistently
  3. Positions — strong opinions with the actual stance, not just the topic
  4. Communication patterns — blunt vs diplomatic, questions vs declarations
  5. Relationships — communities, allies, tribes, arguments
  6. Boundaries — what they reject, block, refuse to engage with
  7. Decision signals — priorities, tradeoffs, sacrifice patterns

Haiku is chosen for categorization because it's fast, cheap, and good enough for pattern extraction. The synthesis step (which requires judgment and prose quality) uses a more capable model.

Pass 2: Synthesize (Sonnet)

All batch analyses are merged and fed to Claude Sonnet with a synthesis prompt. The prompt instructs Sonnet to write in second person (“you”), be radically specific, and produce a document where every statement is falsifiable.

The output must fit within 10KB (the onchain storage limit). If the first pass exceeds this, Sonnet is asked to condense while preserving structure. As a last resort, the document is truncated at the nearest clean line break.

Cost

ComponentModelTypical Cost
Categorization (all batches)Haiku 3.5~$0.08
SynthesisSonnet 4~$0.15
Condensation (if needed)Sonnet 4~$0.10
Total~$0.23

Quality Loop

The generated soul.md is presented in an editable markdown editor before minting. Users can refine, cut, or rewrite any section. The same editor is available for post-mint updates via tba.execute().

Quality validation: feed the soul.md to any LLM as a system prompt, ask questions the person hasn't publicly answered, and check if the responses pass the vibe check — not just factually plausible but tonally and ethically correct. A good soul.md produces consistent identity regardless of which model reads it.

Extraction Pipeline — ETHEREAN Docs