Before you start

What you need

Organise first (a little)

You don't need a perfectly organised archive, but the pipeline works better if your files have some folder structure. If everything is in one flat directory, spend 30 minutes creating rough groupings — by topic, time period, or source. The pipeline processes folders as batches, so structure helps.

Time and cost

Processing a large corpus through an LLM takes time and tokens. For a ~30k file vault, expect the full pipeline to take several hours of AI processing time. You can reduce this significantly by being aggressive at the triage stage — the less you process, the faster and cheaper it is. Start with your highest-value content and expand from there.

Stage 0

Folder Annotation

This is the most important preparatory step, and the one most people skip. Without it, every downstream agent is guessing about what it's reading.

The problem

A folder of markdown files could be your original writing, imported articles, course notes, client deliverables, or reference material. A folder called notes might contain your personal reflections or someone else's lecture notes you saved. Without annotation, AI agents will misattribute content — treating saved articles as your opinions, or reference material as your original work.

The technique

Walk through your folder structure and for each major folder (or subfolder where the purpose isn't obvious from the name), create a short annotation file explaining:

You can do this manually, or have AI sample files and propose annotations for you to confirm. The key is that you — the human — verify the annotations, because only you know the provenance of your files.

Sample prompt

I'm annotating my personal file archive so that AI agents can correctly
interpret the content in later processing stages.

For this folder, sample 5-10 files and tell me:
1. What this folder appears to contain
2. Whether it seems like my original writing or imported/reference material
3. The rough time period (if detectable from dates in filenames or content)
4. Relevance to building a personal profile (high/medium/low/skip)

Folder path: [path]

After reviewing the AI's assessment, correct anything wrong and save the annotation. If you use Claude Code, a CLAUDE.md file in the folder works perfectly — Claude reads these automatically.

Output: An annotation file in each significant folder, explaining what's in it and whose content it is.
Stage 1

Inventory & Triage

Not all files contribute equally to a personal context document. A journal entry about a life decision is worth a hundred utility bills. This stage builds a prioritised manifest of what to process.

The technique

Crawl your file structure and classify every folder (or file group) into tiers:

What to skip

Be aggressive here. Common things to skip entirely:

Sample prompt

I'm building a personal context document — a comprehensive profile of
who I am, distilled from my personal files. I need to triage my file
archive by relevance.

Here's my folder structure with file counts:
[paste folder tree with counts]

For each folder, classify as:
- TIER 1 (process thoroughly): Rich in personal context
- TIER 2 (extract facts): Contains useful biographical/professional facts
- TIER 3 (skim): Minor relevance, might contain a useful detail
- TIER 4 (skip): No personal context value

Also flag any folders where you'd need to see sample files before
classifying.
Output: A prioritised manifest — a list of every folder with its tier classification and processing order.
Stage 2

Folder-Level Extraction

This is where the bulk of the work happens. Each folder (or subfolder) is processed as an independent batch, and an AI agent extracts every personal-context fact it can find.

The technique

Process each folder from your manifest as a separate AI session. For large folders, break them into sub-batches that fit comfortably in a context window. The key is that each batch is independent — which means you can parallelise aggressively.

For Tier 1 content (thorough processing)

Have the agent read every file and extract everything that reveals personal context: facts, opinions, decisions, relationships, skills, timeline events, emotional states, patterns of thought.

For Tier 2 content (fact extraction)

Pull out concrete facts: dates, roles, company names, qualifications, locations, tools used, people mentioned. Don't summarise the prose — just extract the data points.

For Tier 3 content (skim)

Look at file names, folder structure, and a sample of content. What topics recur? What did you choose to save? This reveals interests without needing to read every file.

Parallelisation

If your AI tool supports sub-agents (Claude Code does), launch one agent per folder simultaneously. This is the single biggest time-saver in the pipeline. Instead of processing 15 top-level folders sequentially (hours), process them in parallel (the time of the slowest one).

Sample prompt

I'm building a personal context document about myself. You are processing
one folder from my personal file archive.

Folder: [folder name]
Context: [paste the folder's annotation from Stage 0]

Read every file in this folder and extract everything that reveals
personal context about me:

- Biographical facts (birth, education, locations, family)
- Professional history (roles, companies, what I actually did)
- Projects I've worked on and their outcomes
- Skills, tools, and technologies I've used (with rough proficiency)
- Beliefs, values, opinions, principles
- Interests and passions (especially recurring ones)
- Key relationships and collaborations
- Timeline events with dates (moves, career changes, milestones)
- How I communicate and think (voice, style, decision-making patterns)
- Emotional states, struggles, turning points

Rules:
- Be specific. Include names, dates, places, and concrete details.
- Distinguish between MY writing/views and reference material I saved.
- If a file is clearly someone else's work, note what it reveals about
  my interests (I chose to save it) but don't attribute the views to me.
- Preserve chronological information — note when things happened.
- Include direct quotes where they capture my voice or a strong opinion.
- Flag any contradictions or evolution you notice within this folder.
Output: One summary document per folder (or sub-batch), containing all extracted personal-context facts. Expect 50–100 of these across a large archive.
Stage 3

Thematic Synthesis

The folder summaries from Stage 2 are organised by source, but your me.md needs to be organised by theme. This stage reorganises, deduplicates, and merges.

The problem

The same fact will appear in multiple folder summaries. A project might be mentioned in your journals, your project folder, your CV, and your annual review. A career move might appear in admin records (change of address), personal notes (the decision process), and project files (the work itself). Simply concatenating summaries would produce a repetitive, disorganised mess.

The technique

Take all folder summaries and synthesise them into thematic sections. The suggested themes match the target document structure:

Key challenges

Sample prompt

I'm building a personal context document. Below are summaries extracted
from different folders of my personal file archive. There is significant
overlap and repetition across these summaries.

Synthesise all of these into a coherent draft for the following section:
[SECTION NAME, e.g. "Timeline" or "Career & Professional History"]

Rules:
- Merge duplicate information. Don't repeat the same fact twice.
- When the same topic appears at different time periods, show the
  evolution chronologically.
- Prefer specific details (dates, names, numbers) over vague statements.
- Preserve direct quotes that capture my voice or strong opinions.
- Flag any contradictions you can't resolve so I can clarify.
- Aim for [X] words for this section.
- Prioritise information from journals and personal reflections over
  admin records for subjective content (beliefs, interests, emotions).

Source summaries:
[paste all folder summaries]

Run this once per thematic section. You can parallelise these too — each section synthesis is independent.

Output: 8–12 thematic section drafts, each covering one aspect of your personal context with information merged from all sources.
Stage 4

Final Composition

Assemble the thematic sections into a single, coherent document and compress it to fit your context budget.

The technique

Feed all section drafts to AI and have it compose the final document. This isn't just concatenation — it's editing for flow, removing cross-section redundancy, and enforcing the word budget.

Sample prompt

Below are thematic section drafts for my Personal Context Document
(me.md). Assemble these into a single, coherent document.

Guidelines:
- Target length: 4,000–6,000 words. This is a hard constraint.
- Every sentence must earn its place. If removing it wouldn't change
  how an AI assists me, cut it.
- Use clear markdown structure with ## headers for major sections
  and ### for subsections.
- Lead each section with the most important information.
- The Timeline section should read as a chronological narrative,
  not a list of bullet points.
- Don't editorialize or add interpretation. Stick to facts and
  direct quotes from my own writing.
- End with a "Current Context" section covering my present situation,
  active projects, and priorities — this is the most immediately
  useful section for AI interactions.
- Preserve specific details (dates, names, places, numbers) —
  these are what make the document useful rather than generic.
- Where my views or situation have evolved over time, show the
  trajectory, not just the current state.

Section drafts:
[paste all thematic section drafts]
Output: Your me.md — a single markdown document, 3,000–8,000 words, containing a comprehensive personal context profile distilled from your entire digital life.

After the pipeline

Review and correct

Read your me.md carefully. With multiple summarisation layers, facts can drift. Check that:

Edit directly. The pipeline gives you a strong first draft; your knowledge of your own life makes it accurate.

Where to put it

Keeping it current

Your me.md will decay. The "Current Context" section goes stale within weeks. A refresh schedule:

You don't need to re-run the full pipeline for routine updates — just edit the relevant sections. Only re-run the pipeline if you've accumulated a large amount of new source material (e.g., a year of journals).

Writing principles for your prompts

These principles should guide the AI at every stage of the pipeline. Include them in your prompts or in a project-level instruction file.

Specificity over generality

Vague

"Experienced with several programming languages."

Useful

"Primary stack: Go and PostgreSQL (7 years). Frontend: Svelte (3 years, previously React). Comfortable with Python for scripting."

Specific details let the AI calibrate. Vague statements tell it nothing.

The why, not just the what

Flat

"Moved to Berlin in 2019."

Revealing

"Moved to Berlin in 2019 — wanted to be closer to the startup ecosystem after years of remote work."

Motivation reveals values and decision-making patterns.

Evolution, not just current state

Static

"I believe in test-driven development."

Shows growth

"Converted to TDD around 2020 after a production incident that tests-first would have caught. Now non-negotiable for anything that handles money."

Evolution shows how you think and learn, not just what you currently believe.

Honest proficiency levels

Don't let the AI inflate your expertise. If you're intermediate at something, the document should say so. An AI that thinks you're an expert will skip explanations you need.

Tensions and contradictions

Real people are contradictory. You might value minimalism but hoard side projects. You might preach work-life balance but work 60-hour weeks when excited. Instruct the AI to preserve these tensions — they give a more accurate model of how you actually behave.

Write for an AI audience

The document will be consumed by a system that uses it to calibrate responses. That means:

Appendix: source material by value

Use this to guide your triage decisions in Stage 1.

High-value (Tier 1)

Medium-value (Tier 2)

Low-value (Tier 3)

Skip (Tier 4)