Open Knowledge FormatBuild an AI-maintained knowledge base — from beginner to enterprise
A knowledge base stored as plain Markdown files that humans and AI agents can read, write, and use together.
Preface
Welcome to The Open Knowledge Format (OKF) Handbook. This book walks you through building and operating an AI-maintained knowledge base with OKF — a knowledge base stored as Markdown files with YAML frontmatter that both humans and AI agents can use, without any SDK or special database.
🌐 อ่านฉบับภาษาไทยได้ที่ปุ่ม ไทย มุมขวาบน (Thai version available — click ไทย in the top bar)
Table of Contents
Who this book is for
- Beginners who want a searchable, vendor-neutral personal/team knowledge base
- Developers / data teams who want AI agents to reach internal knowledge systematically
- Architects / platform teams who need a shared system across sessions and teams, on-prem
No prior OKF knowledge needed — just basic command line and Git.
Conventions
- Commands you type go in code blocks · technical terms (
concept,frontmatter,bundle,MCP) stay in English to match the source docs and code · blockquotes are warnings/tips
Versioning note: OKF is a v0.1 spec (published 2026-06-12 by Google Cloud). The core spec requires only the
typefield — most "best practices" here come from the LLM-wiki community and Google's reference implementation.
Source project (code + all tools): https://github.com/supachai-j/open-knowledge-format-starter
Start at What is OKF
Table of Contents
Part 1 — Meet OKF
Part 2 — Getting Started
Part 3 — Core Concepts
- Bundle, Concept, and Concept ID
- Frontmatter (metadata)
- Linking into a knowledge graph
- Reserved files: index.md and log.md
Part 4 — Everyday Operations
- Ingest: adding knowledge to the wiki
- Query and Search
- Adding and editing concepts
- Validate and Visualize
- Worked example: a bookstore KB
Part 5 — Authoring Well
Part 6 — Enterprise
- Architecture overview
- Self-hosted deployment
- Write models: PR-gated and Lease
- Search at scale and semantic
- Security and governance
Appendix
What is OKF?
Open Knowledge Format (OKF) is an open specification for storing an organization's "knowledge" as a directory of Markdown files with YAML frontmatter, so that both humans and AI agents can write, read, exchange, and use that knowledge — without needing any SDK, database, or specialized tooling.
If you can
cata file, you can read OKF · If you cangit clonea repo, you can share OKF
Background
OKF v0.1 was released on 12 June 2026 by the Data Cloud team at Google Cloud (Sam McVeety and Amir Hormati) as a way of turning the "LLM-wiki pattern" proposed by Andrej Karpathy into a portable, vendor-neutral standard.
The core idea of LLM-wiki is: instead of having an LLM re-discover raw documents from scratch every time a question is asked, let AI progressively synthesize knowledge into organized, pre-linked Markdown pages, then load only the relevant pages directly into context.
Core Components (see Part 3 for details)
| Term | Short definition |
|---|---|
| Bundle | A directory of all knowledge files — the unit of distribution |
| Concept | One unit of knowledge = one .md file (e.g. a table, a metric, a playbook) |
| Concept ID | The file's path within the bundle with .md stripped, e.g. tables/orders.md → tables/orders |
| Frontmatter | The YAML block at the top of a file (stores metadata such as type, title, tags) |
| Link | A Markdown link between concepts = builds relationships into a graph |
Example: a single concept file
File tables/orders.md:
---
type: BigQuery Table
title: Orders
description: One row per customer order
tags: [sales, orders]
timestamp: 2026-06-15T00:00:00Z
---
# Schema
| Column | Type | Description |
| :--- | :--- | :--- |
| order_id | STRING | Order identifier (unique) |
| customer_id | STRING | FK to [customers](customers.md) |
# Joins
Joined to [customers](customers.md) via `customer_id`
As you can see, this is plain Markdown that is immediately human-readable — just a small YAML header and links to other concepts.
3 Design Principles
- Minimally opinionated — frontmatter requires only a single
typefield; everything else is defined by the producer. - Producer and consumer are independently decoupled — a bundle written by hand, generated by an agent, or exported by a pipeline can be read by any tool.
- A graph, not just a tree — concepts link to each other via Markdown links, creating richer relationships than folder structure alone.
What OKF is NOT
- Not a fixed taxonomy — it does not prescribe which types must exist.
- Not an opinion on how knowledge must be stored, served, or searched.
- Not a replacement for domain-specific schemas (Avro, Protobuf, OpenAPI) — OKF references those; it does not absorb them.
Next, let's look at why you would choose OKF over traditional RAG → Why OKF?
Why OKF? (vs. RAG)
If you have ever built an AI question-answering system on top of organizational documents, you have probably used RAG (Retrieval-Augmented Generation) before. OKF is not here to kill RAG — it is here to address the weaknesses that RAG most commonly hits in production.
How traditional RAG works
Raw documents → chunk → embed → store in vector DB
↓ (at query time)
query → find nearest chunks → stuff into context → answer
The LLM re-discovers knowledge from scratch on every query — no accumulation, no understanding of which chunk relates to what, contradicts what, or is stale.
Common RAG failure modes
| Problem | What happens |
|---|---|
| Chunking artifacts | Sentences are cut mid-way — e.g. "employees may work from home, except during the first 90 days" may be reduced to just "employees may work from home" → confidently wrong answer |
| Knowledge decay | New documents keep arriving but old contradicting content stays → old content gets retrieved (a leading cause of RAG project failures) |
| Black box | Retrieval returns chunks that look similar but are not the best answer, and nobody can see why |
| No accumulation | A question requiring synthesis across 5 documents must be reassembled from scratch every time; nothing is built up permanently |
How OKF is different
OKF shifts from "pull raw chunks at query time" to "a wiki that AI maintains and pre-synthesizes":
- When a new source arrives, AI reads, summarizes, and merges it into the existing wiki — updating pages, fixing cross-references, and flagging conflicts when new information contradicts old.
- At query time, already-digested Markdown pages are loaded directly into context — no chunking, no vector math.
- Knowledge becomes a compounding asset: every new source added makes the wiki richer.
| Dimension | RAG (raw) | OKF (pre-synthesized) |
|---|---|---|
| Storage format | Specialized vector DB | Plain Markdown files + git |
| Human-readable? | No (UI / binary) | Yes, with any text editor |
| Version control | Difficult | Built-in (git diff / PR / blame) |
| Knows about conflicts / staleness? | No | Can flag and supersede |
| Vendor lock-in | High | None (it's just files) |
Is "RAG is dead" actually true?
No — most engineers view them as separate layers, not an either/or choice:
- Layer 1 — wiki (OKF): Core synthesized knowledge. Finding the answer here is the end of the search (fastest, highest signal).
- Layer 2 — raw documents + vector search: Used when the wiki does not yet cover the question (fall back to raw sources).
- Layer 3 — LLM general knowledge: Fills gaps that exist in neither the wiki nor the raw documents.
The OKF starter in this book supports both worlds: the wiki as the primary layer, plus hybrid search (BM25 + semantic) available as a next step when the wiki grows (see Part 6).
When to use / not use OKF
Good fit when: knowledge must be long-lived, reused, shared across multiple people or agents, auditable, and you want to avoid vendor lock-in.
May not be worth it when: you have a massive corpus of raw documents that will never be synthesized (straight RAG is more cost-effective), or the data is single-use and throwaway.
Ready to get started → Install
History and Evolution of the Knowledge Base
For over sixty years, humanity has asked the same question again and again: "How do we make machines remember and understand what we know?" The answer has changed with every era — from hand-crafted rules, to semantic graphs, to word indexes, to interlinked notes, and most recently to AI that synthesizes knowledge on our behalf. This chapter is the story of that journey, and its (provisional) destination: OKF.
Reference numbers
[n]point to the Bibliography
flowchart LR E1["1960s–80s
Expert systems"] --> E2["1990s–2000s
Semantic Web"] --> E3["1970s–2010s
Databases & IR"] --> E4["1995–2020
Wiki & PKM"] --> E5["2012–2020s
AI / RAG"] --> E6["2026
LLM-wiki & OKF"]
Act 1 — The Era We Tried to Hand-Feed Knowledge to Machines
In the 1960s at Stanford, a group of scientists believed that intelligence in a narrow domain could be captured by encoding expert knowledge as rules and letting a machine reason over them. They built DENDRAL (1965), which analyzed molecular structures from mass spectrometry data on par with a trained chemist [1] — the first time the world saw that "knowledge," not "search," was the real key to AI.
A few years later, MYCIN (early 1970s, Edward Shortliffe) used around 600 IF-THEN rules to diagnose bloodstream infections with accuracy matching senior physicians — and it could explain its own reasoning [1]. The heart of MYCIN was an architecture that lives with us to this day: separating the knowledge base (what is known) from the inference engine (how to reason):
IF infection-type = primary-bacteremia
AND culture-site = blood
AND portal-of-entry = gastrointestinal-tract
THEN there is suggestive evidence (CF = 0.4) that the organism is Bacteroides
But the dream hit a wall. In 1984, Douglas Lenat launched the Cyc project — an attempt to hand-encode all of human commonsense knowledge [1]. Decades and tens of millions of dollars later, the world learned an expensive lesson called the "knowledge acquisition bottleneck": hand-entering knowledge can never scale.
🧬 DNA inherited by OKF: Separating "knowledge (files)" from "the engine that uses knowledge (agent)" — OKF's producer/consumer architecture is a direct descendant of this idea.
Act 2 — The Era We Tried to Give "Meaning" to the Web
In 2001, the father of the web, Tim Berners-Lee, published an article in Scientific American dreaming of the Semantic Web [2] — a web that was not merely documents for humans to read, but data whose meaning machines could understand. He proposed storing knowledge as triples (subject–predicate–object), which are simply "edges in a graph":
@prefix ex: <http://example.org/> .
ex:TimBernersLee ex:invented ex:WorldWideWeb .
With RDF, OWL, and SPARQL, machines could infer new facts from the graph automatically [2]. It was beautiful in theory — but writing a correct ontology was far beyond what ordinary people could manage. By 2013, fewer than 2% of websites used semantic markup. What survived and flourished instead were the "easier to use" successors: linked data, schema.org (2011), and the knowledge graph [2].
🧬 DNA into OKF: Knowledge connected as a graph has enormous value — but OKF chose plain Markdown links (untyped) over strict RDF/OWL, so that people can actually write it.
Act 3 — The Era We Learned to Search Intelligently
While AI dreamed big, another line of work proceeded quietly — and changed the world. In 1970, Edgar Codd of IBM proposed the relational model, allowing data to be stored in tables and queried with SQL [3]. But it matched records exactly; it could not search free text and rank results by relevance.
The answer came from Karen Spärck Jones (1972), who proposed IDF — a simple but profound idea: a word that appears in fewer documents is a stronger signal than one that appears everywhere [3]. Combined with term frequency this became TF-IDF, and later BM25 (~1994), which remains the standard for lexical search to this day. Here is a tiny example — the word "the" lives in every document and is worthless for discrimination, while the word "dog," found in a single document, stands out sharply:
| Term | Documents | Distinctiveness |
|---|---|---|
| the | D1, D2, D3 | 0 (useless) |
| dog | D3 | high |
When Doug Cutting released Lucene (1999) and later Elasticsearch (2010), industrial-grade full-text search landed in everyone's hands [3]. The one limitation that lingered: it was lexical — searching "car" would miss "automobile" because it understood no meaning.
🧬 DNA into OKF: BM25 is still powerful and lightweight —
tools/okf-index.pyuses it as the primary search method.
Act 4 — The Era Knowledge Became Everyone's
In 1995, Ward Cunningham created WikiWikiWeb, the first website that anyone could edit [4]. Six years later, Wikipedia (2001) proved that humanity's collective knowledge could grow through open contribution [4].
But the most remarkable story had unfolded earlier, on the desk of a German sociologist. Niklas Luhmann had accumulated a Zettelkasten of some 90,000 index cards, each linked to others. He produced more than 50 books from it and left behind an immortal principle: "The value lies not in the individual note, but in the links between notes." [4]
The digital age rediscovered this principle around 2016–2020 through Notion, Roam, and Obsidian — all
built on Markdown + [[wikilinks]], which won the format wars for compounding reasons: humans can read
it without rendering, and machines can parse it without a special parser [4].
---
title: "Zettelkasten Principle"
tags: [pkm]
---
The value of the card box lies in the **links between notes** — see also [[Obsidian]]
🧬 DNA into OKF: This is the most direct lineage — Markdown + YAML frontmatter + links between concepts is exactly what OKF looks like.
Act 5 — The Era "Meaning" Became Geometry
In 2013, Tomas Mikolov's team at Google revealed something that looked like magic: word2vec converted words into vectors where "similar meaning = nearby position," making arithmetic of meaning possible [5].
flowchart LR K["king"] -- " − man + woman " --> Q["≈ queen"]
The years that followed accelerated relentlessly: Google Knowledge Graph (2012, "things, not strings"), FAISS (2017) searching billions of vectors, BERT (2018) giving the same word different meanings depending on context [5]. Then in 2020, RAG (Lewis et al.) arrived to fix the largest weakness of LLMs — fabrication — by retrieving real evidence to ground answers [5]:
flowchart LR D["Documents"] --> C["Chunk"] --> EM["Embed"] --> S["Vector store"] Q["Query"] --> R["Retrieve top-k"] S --> R --> G["LLM generate
grounded answer"]
And to combine the strengths of "exact term matching (BM25)" with "semantic matching (vector)," the world reached for Reciprocal Rank Fusion (RRF, 2009), producing the hybrid search that is today's default [5].
🧬 DNA into OKF: wiki = Layer 1 (pre-synthesized); RAG/vector = Layer 2 (mining raw sources);
okf-search.pyfuses BM25 + semantic with RRF.
Act 6 — The Present: When AI Takes Care of Knowledge Itself
In April 2026, Andrej Karpathy posted a short idea that ignited the entire field: "LLM wiki" [6] — instead of retrieving raw chunks on every query (as RAG does), have an agent compile raw sources into organized, interlinked, continuously maintained Markdown. Synthesize once at ingest; knowledge therefore compounds — the more you use it, the richer it gets, rather than starting from zero every time.
Two months later, on 12 June 2026, Google Cloud (Sam McVeety, Amir Hormati) made this pattern an open
standard under the name Open Knowledge Format (OKF) v0.1 [6] — a directory of Markdown files + YAML
frontmatter requiring only a type field, with separate producer/consumer roles, portable across clouds and
frameworks.
Before that, MemGPT/Letta (2023) had already demonstrated "LLM as OS" — managing memory in a tiered fashion (in-context = RAM, external = disk) and paving the way for agents with persistent memory [6].
Epilogue — And the Future That Is Coming
The direction ahead is a self-maintaining knowledge base: agents do not merely "query" knowledge — they
curate it — checking whether information has gone stale (via timestamp/log.md), reconciling conflicts
across concepts, and proposing updates for human approval before committing [6]. The next layer will likely be
a hybrid wiki + RAG architecture: the pre-synthesized wiki serves as a fast index, while RAG fills gaps
for data that changes too frequently to precompile — with agent memory as the runtime and multi-agent systems
dividing responsibilities for curating and consuming knowledge, moving toward what is beginning to be called
"compiled-knowledge generation".
Why OKF Is the Sum of All Six Acts
OKF does not reinvent everything — it fuses the best parts of every era into one. Six streams converge into a single confluence:
flowchart TD A["Expert systems
knowledge vs. engine"] --> OKF B["Semantic Web
knowledge as a connected graph"] --> OKF C["IR / BM25
fast and lightweight search"] --> OKF D["Wiki / PKM
Markdown + links, readable & versionable"] --> OKF E["AI / RAG
embeddings + AI synthesis"] --> OKF OKF["✦ Open Knowledge Format ✦"]
Next, dive into the Core Concepts Explained (with Examples) introduced throughout this chapter.
Core Concepts Explained (with Examples)
This chapter summarizes the foundational concepts of Knowledge Base systems introduced in the History chapter — each topic includes a short definition + concrete example + how it relates to OKF so you can apply it immediately.
1. Knowledge Representation
What it is: A way of storing "what is known" so a machine can process it. Classic forms: production rules (IF-THEN), semantic networks (concept graphs), frames (slot-filler).
Example (frame):
FRAME: Bacteroides
IS-A: Anaerobic-Gram-Negative-Rod
Gram-stain: negative (default)
Treatment: [metronidazole, clindamycin]
In OKF: Each concept (a .md file) is one unit of knowledge — the frontmatter holds queryable fields; the body holds detail.
2. Ontology
What it is: A machine-readable specification of the concepts and relationships in a domain (more than a taxonomy — it includes logic that enables inference).
Example (RDF triple): TimBernersLee — invented — WorldWideWeb (subject–predicate–object)
In OKF: Markdown links between concepts serve a similar role to triples but are untyped — the type of relationship lives in the prose (much lighter than OWL).
3. Inverted Index
What it is: A data structure that maps "term → list of documents containing that term" — the heart of full-text search.
Example:
| Term | Documents |
|---|---|
| cat | D1, D2 |
| dog | D3 |
Search "dog" → returns D3 immediately, without scanning every document.
In OKF: okf-index.py builds an in-memory inverted index to power BM25.
4. TF-IDF & BM25 (Relevance Ranking)
What it is: A scoring formula measuring how well a document matches a query — TF (how often the term appears in the document) × IDF (how rare the term is across the corpus = more distinctive); BM25 improves on this with length normalization and saturation.
Example: The word "the" appears in every document → IDF = log(3/3) = 0 → score 0 (does not help discriminate). The word "dog" appears in only one document → high IDF → distinctive.
In OKF: BM25 is the primary search method in okf-search.py (lightweight, no extra dependencies).
5. Embeddings (Semantic Vectors)
What it is: Converting text into a numeric vector where "similar meaning = similar vector".
Example (word2vec): king − man + woman ≈ queen — semantic relationships become vector arithmetic.
In OKF: okf-embed.py generates embeddings for concepts using a local model (Ollama) for semantic search.
6. Vector / Semantic Search
What it is: Searching by vector proximity (e.g., cosine similarity) → captures meaning and synonymy that keyword search misses.
Example: Searching "car" finds a document written with "automobile" because their vectors are close.
In OKF: An optional layer (opt-in) — if embeddings/Ollama are unavailable it automatically falls back to BM25.
7. RAG (Retrieval-Augmented Generation)
What it is: Retrieving relevant information and inserting it into the LLM's context at query time to ground the answer (reduce hallucination, enable citation).
Example (5 steps):
flowchart LR D["Documents"] --> C["chunk"] --> E["embed"] --> S["vector store"] Q["Query"] --> R["retrieve top-k"] S --> R --> G["LLM generate
grounded answer"]
In OKF: The wiki = Layer 1 (pre-synthesized; finding it in the wiki is sufficient); RAG = Layer 2 (mining raw documents when the wiki does not yet cover the topic).
8. Hybrid Search & RRF
What it is: Combining results from multiple retrieval methods (BM25 + semantic) using Reciprocal Rank Fusion: score(d) = Σ 1/(k + rank) (k=60).
Example: A document that ranks highly in both BM25 and semantic search floats to the top, with no single signal dominating.
In OKF: okf-search.py uses RRF to fuse BM25 + semantic results.
9. Knowledge Graph
What it is: A graph of entities (nodes) + typed relationships (edges) — "things, not strings" — enabling entity-level disambiguation and reasoning.
Example:
flowchart LR CU["Customer A"] -->|orders| O["Order 123"] O -->|contains| P["Product X"] CU -->|in segment| SEG["High-value"]
In OKF: The entire bundle forms a knowledge graph (concepts = nodes, links = edges) — viewable with okf-viz.py.
Summary Table
| Concept | What it captures | Example in OKF |
|---|---|---|
| Knowledge representation | Knowledge structure | concept + frontmatter |
| Ontology | Semantic relationships | Markdown links (untyped) |
| Inverted index / BM25 | Exact term matching | okf-index.py |
| Embeddings / vector search | Semantic meaning | okf-embed.py |
| RAG | Grounding LLM answers | wiki (L1) + RAG (L2) |
| Hybrid / RRF | Fusing multiple signals | okf-search.py |
| Knowledge graph | Entities + relationships | bundle + okf-viz.py |
See the Bibliography for the original papers and standards behind each concept.
Install
The OKF starter is a pure-Python tool (no mandatory dependencies) that works fully offline / air-gapped. Requirements:
- Python 3.13+ (every tool uses the stdlib only)
- Git (for version-controlling your bundle)
- (Optional) Ollama if you want local semantic search
- (Optional) Docker if you want to self-host the server (Part 6)
Option 1 — Install as a Claude Code skill (recommended)
This approach lets every project and every session create and maintain an OKF bundle without needing to be inside the source repository.
git clone https://github.com/supachai-j/open-knowledge-format-starter.git
cd open-knowledge-format-starter
./install.sh # install globally → ~/.claude/skills/okf (available to all projects)
./install.sh --project # install for this project only → ./.claude/skills/okf
./install.sh --dir <path> # install to a custom path
./install.sh --uninstall # remove the installation
install.sh bundles the skill (the SKILL.md file + all Python tools + the embedded viewer library)
into a self-contained skill folder. Afterwards, open Claude Code and type /okf or say
"init an OKF knowledge base here" to get started.
How the skill works: When invoked, Claude Code locates the skill and runs the scripts in its
scripts/directory — regardless of whether it was installed globally or per-project.
Option 2 — Use the repo directly
If you prefer to work inside the repository itself (or are not ready to install the skill yet), all tools
are available under tools/:
git clone https://github.com/supachai-j/open-knowledge-format-starter.git
cd open-knowledge-format-starter
python3 tools/okf-validate.py ./wiki # → ✓ CONFORMANT with OKF v0.1
python3 tools/okf-viz.py ./wiki # → wiki/viz.html (open in browser)
Option 3 — Use as a GitHub template
The source repository is configured as a template repository — click "Use this template" on GitHub to create your own new repo with the full structure already in place.
Verify the installation
python3 --version # should be 3.13 or higher
python3 tools/okf-validate.py --help 2>/dev/null || python3 tools/okf-validate.py ./wiki
If you see ✓ CONFORMANT with OKF v0.1, you are ready to go. Let's create your first knowledge base →
Create your first knowledge base
Build Your First Knowledge Base
Let's create your first OKF bundle in just a few minutes.
Step 1 — Scaffold the structure
Use okf-init.py to scaffold a new bundle in the current directory (or a directory you specify):
python3 tools/okf-init.py . # scaffold in the current directory
# or if installed as a skill:
python3 ~/.claude/skills/okf/scripts/okf-init.py ./my-kb
Output:
Scaffolding OKF bundle in /path/to/project
wrote: AGENTS.md
wrote: raw/README.md
wrote: wiki/index.md
wrote: wiki/log.md
wrote: wiki/getting-started.md
✓ done. Next: validate with okf-validate.py, then ingest sources from raw/.
Safe: If a non-empty
wiki/directory already exists, the script will not overwrite it unless you pass--force.
Step 2 — Check conformance
python3 tools/okf-validate.py ./wiki
# → ✓ CONFORMANT with OKF v0.1 (0 warning(s), 0 info)
Step 3 — View the graph
python3 tools/okf-viz.py ./wiki --name "My First KB"
# → wiki/viz.html (single file, open in a browser instantly — no internet required)
Open wiki/viz.html to see concepts as nodes connected by edges. Click any node to view its details.
Step 4 — Start capturing knowledge
- Drop your source files (PDFs, notes, exports) into the
raw/folder. - Tell the AI agent to "ingest raw/
into the wiki" — the agent will extract key points for your approval before writing anything. (Details in the Ingest chapter.)
Writing concepts by hand
OKF is plain Markdown — you can create a concept yourself. For example, create wiki/glossary/wau.md:
---
type: Metric
title: Weekly Active Users (WAU)
description: The number of unique customers who placed at least one order in the past 7 days
tags: [growth, metric]
timestamp: 2026-06-15T00:00:00Z
---
# Definition
Count unique `customer_id` values that have at least one order in the last 7-day window.
Then run okf-validate.py again to confirm it is still conformant.
Quick-reference command table
| Task | Command |
|---|---|
| Create a new bundle | okf-init.py <dir> |
| Check conformance | okf-validate.py ./wiki |
| View the graph | okf-viz.py ./wiki |
| Build the search index | okf-index.py build ./wiki |
| Search | okf-search.py "question" --bundle ./wiki |
Next, understand the project layout that was scaffolded for you.
Project Layout
The OKF starter uses a three-layer architecture with clearly separated responsibilities.
my-kb/
├── AGENTS.md ← Layer 3: schema + agent operating rules (read this first)
├── raw/ ← Layer 1: raw source material (read-only)
│ └── README.md
├── wiki/ ← Layer 2: OKF bundle (agent-managed)
│ ├── index.md ← reserved file: table of contents (progressive disclosure)
│ ├── log.md ← reserved file: change log
│ └── getting-started.md
└── tools/ ← tooling (validate, viz, index, ...)
What are the three layers?
| Layer | Folder | Who writes | Role |
|---|---|---|---|
| 1. Raw sources | raw/ | Human | Source of truth — read-only, agents never modify |
| 2. Wiki | wiki/ | Agent | Synthesized knowledge (OKF bundle) |
| 3. Schema | AGENTS.md | Human + agent jointly | Rules, conventions, and workflow that govern the agent |
Key point: wiki/ is the "bundle root", so Concept IDs are counted from wiki/. For example,
wiki/tables/orders.md has Concept ID = tables/orders.
AGENTS.md — The most important file
This is the first file an agent reads. It tells the agent:
- The structure of the bundle
- Frontmatter rules (which fields are required/recommended, the controlled vocabulary for
type) - Linking rules (use relative paths, never start with
/) - The ingest / query / validate workflow
This file is what turns the AI into a "disciplined wiki curator" rather than a generic chatbot. You and the agent will gradually develop this file together to fit your domain.
If you use a different agent, the filename might be CLAUDE.md (Claude Code) or GEMINI.md — the content is the same.
raw/ — Raw source material
Place your originals here: PDFs, meeting notes, exports, datasets.
- Immutable — agents read but never modify. This is the source of truth.
- Files in
raw/are not part of the bundle (the bundle iswiki/); they are the input to the ingest process. - In the real starter,
.gitignorewill not commit files insideraw/to prevent accidentally pushing private or large data (onlyraw/README.mdis kept) — if you want to version-control your actual sources, comment out theraw/*line.
wiki/ — OKF bundle
This folder is owned by the agent. It organizes concepts into categories. The initial scaffold contains:
wiki/
├── index.md ← root table of contents (contains okf_version)
├── log.md ← change log
├── getting-started.md ← example concept
└── (yours) tables/ references/ playbooks/ ...
Commonly seen categories (from the Google reference implementation):
tables/— tables (type: BigQuery Table)datasets/— datasets (type: BigQuery Dataset)references/— synthesized knowledge, e.g.references/metrics/,references/joins/(type: Reference)
Next, dive into the core concepts — start with Bundle, Concept, and Concept ID.
Bundle, Concept, and Concept ID
These three terms are the heart of OKF. Understand them and you understand half the spec.
Bundle (knowledge set)
A bundle is the directory containing all knowledge files — it is the unit of distribution. A single bundle can be shared in several forms:
- As a git repository (recommended — you get history, diffs, and code review for free)
- As a tarball / zip
- As a subdirectory inside a larger repository
In this project the bundle is the wiki/ folder.
Concept (unit of knowledge)
A concept is one unit of knowledge, represented as a single Markdown file. It may describe:
- Tangible things — a database table, an API endpoint
- Abstract things — a business metric, a playbook, a process
The golden rule: one file = one concept. Do not pack three topics into one file.
A concept has two parts:
---
type: Metric ← (1) Frontmatter — YAML block
title: ...
---
← (2) Body — free-form Markdown
# Definition
...
Concept ID (concept identifier)
Concept ID = the file's path inside the bundle with .md removed.
File (inside wiki/) | Concept ID |
|---|---|
tables/orders.md | tables/orders |
references/metrics/wau.md | references/metrics/wau |
playbooks/incident.md | playbooks/incident |
Key point: identity is tied to the file path, so…
⚠️ Renaming or moving a file = changing the Concept ID = breaking all inbound links. Choose stable filenames from the start. Use lowercase with hyphens (kebab-case), e.g.
weekly-active-users.md.
The advantage of using the path as identity: you get sovereign identity with no central registry — just look at the file's address to know which concept it is.
Tree structure, graph relationships
Folders provide a tree structure (parent-child), but concepts can link to each other freely via Markdown links, forming a graph that is richer than the folder hierarchy (see the Linking chapter).
wiki/
├── tables/orders.md ─────┐ (link "joins-with")
├── tables/customers.md ◄─┘
└── references/metrics/wau.md ──► tables/orders.md ("derived-from")
Next, examine the file header in detail → Frontmatter
Frontmatter (Metadata)
Frontmatter is a YAML block at the top of a file, delimited by ---, that stores metadata intended for search, filtering, and indexing.
---
type: BigQuery Table # required
title: Orders # recommended
description: หนึ่งแถวต่อหนึ่งคำสั่งซื้อ # recommended
resource: https://... # recommended (if a real resource exists to point to)
tags: [sales, orders] # recommended
timestamp: 2026-06-15T00:00:00Z # recommended (ISO 8601)
---
Required Fields
| Field | Description |
|---|---|
type | The kind of concept — this is the only required field in v0.1. Consumers use this value for routing, filtering, and display. |
Recommended Fields (in order of importance)
| Field | Description |
|---|---|
title | A human-readable name. If omitted, consumers may infer it from the filename. |
description | A one-line summary — this is what an agent reads to decide whether to load the file. Write it to be precise. |
resource | URI of the real-world resource that the concept describes (may be omitted for abstract concepts). |
tags | A YAML list for cross-domain categorisation, e.g. [sales, revenue]. |
timestamp | The time of the most recent significant edit (ISO 8601 format). |
💡
descriptionmatters more than you think — write it precisely, e.g. "Number of unique customers with ≥1 order in the last 7 days" is better than "About WAU"
Extension Fields
Producers may add any fields they like, and consumers must tolerate unknown keys (they must not reject the file).
For example, add owner, sla, or pii: true as your domain requires.
Controlled Vocabulary for type
Although the spec does not enforce a taxonomy, type values should be used consistently — otherwise data aggregation becomes impossible.
This starter uses the following set:
BigQuery Table · BigQuery Dataset · Metric · Reference · Playbook · API Endpoint · Concept · Entity
Convention from Google's reference bundle: Synthesised or reference knowledge (joins, metric definitions, glossary) is typically placed under
references/withtype: Reference, while concrete assets live undertables//datasets/.
Pitfalls
- Do not over-stuff frontmatter — keep it semantic (entity, intent, definition); do not dump keywords, because noise degrades search quality.
- Use field names consistently — if one file uses
author_nameand another useswritten_by, aggregation breaks.
Body Below Frontmatter
The content section is free-form Markdown, but structured markup (headings, bullets, tables) is preferred over long prose paragraphs. Conventional headings (use when appropriate, in this order):
# Overview → # Schema → # Common query patterns (code: sql) → # Joins → # Examples → # Citations
Next, see how to connect concepts together → Linking as a knowledge graph
Linking as a Knowledge Graph
Concepts connect to one another via standard Markdown links, turning the directory into a graph of relationships that is richer than the parent–child structure of folders alone.
Use File-Relative Links Only
See the [customers](../tables/customers.md) table for the key used in the join.
⚠️ Do not begin a link with
/— this is an important and subtle rule.OKF spec §5.1 recommends absolute-style links (
/tables/customers.md), claiming they are stable when files move. However, Google's actual enrichment agent forbids them because/-prefixed links break GitHub rendering, and every real Google bundle uses relative links throughout.We follow the real implementation: relative links only (
okf-validate.pywill warn if it finds a/-prefixed link).
Links Are Untyped
A link from concept A → B simply "asserts that a relationship exists", but the type of that relationship (parent–child, reference, joins-with, depends-on) lives in the surrounding prose, not in the link itself.
Joins with [customers](../tables/customers.md) via `customer_id` (many orders → one customer)
└─ untyped link └─ relationship type is expressed in this sentence
Consumers that build graph views treat every link as a directed, untyped edge.
Broken Links Are Permitted
A link pointing to a concept that does not yet exist is not an error — it represents knowledge not yet written (a placeholder / gap to be filled later). Consumers must tolerate broken links.
Rules for Good Linking
- ✅ Use relative paths only (
../tables/orders.md,customers.md) - ✅ Link only to concepts that actually exist (or intentionally leave placeholders)
- ❌ Do not over-link — one link per mention per section is enough
- ❌ Do not link from headings, inside code blocks, or in schema field-name lists
- ❌ Do not link a file to itself
Example: A Small Graph
references/joins/orders__customers.md
├──► tables/orders.md
└──► tables/customers.md
metrics/weekly-active-users.md ──► tables/orders.md
playbooks/incident-response.md ──► metrics/weekly-active-users.md
When you open viz.html you will see this graph interactively, including "Cited by" (backlinks — which concepts link to this one).
Next, see the two special files that help navigate the graph → Reserved files: index.md and log.md
Reserved Files: index.md and log.md
OKF reserves two filenames that carry special meaning at every level of the directory tree, and must not be used as concept files.
| File | Purpose |
|---|---|
index.md | Directory table of contents (progressive disclosure) |
log.md | Change history log |
index.md — Table of Contents for Progressive Disclosure
Allows people and agents to see what is available before opening any real file — this is crucial for preventing
context-window overflow during exploration. An agent reads index.md first to find relevant concepts, then drills into individual files.
Format: no frontmatter (except at the root index — see below); content is headings and lists in the form
* [Name](link) - short description
# Subdirectories
* [tables](tables/index.md) - BigQuery tables and schema/join information
* [references](references/index.md) - Synthesised knowledge (joins, metric definitions)
# Concepts
* [tables/orders](tables/orders.md) - One row per order
Producers may auto-generate
index.md; consumers may build it on the fly when reading if it is absent.
okf_version at the Root Index
At the bundle root only, index.md may carry frontmatter — and only one field: okf_version, which declares
the spec version this bundle conforms to:
---
okf_version: "0.1"
---
# Concepts
...
log.md — Change History
Records changes within that scope as a list sorted newest first, grouped under ISO date headings.
# Directory Update Log
## 2026-06-15
* **Update**: เพิ่มตาราง [Customer Metrics](references/metrics/cm.md)
* **Creation**: สร้าง [Dataplex Playbook](playbooks/dataplex.md)
## 2026-06-12
* **Initialization**: สร้างโครงสร้างไดเรกทอรีพื้นฐาน
Rules:
- Date headings must follow the ISO format
YYYY-MM-DD(no brackets) - Bold prefixes (
**Update**,**Creation**,**Deprecation**) are convention, not mandatory
💡 Consistent prefixes make grepping easy, e.g.
grep '^- ' log.mdto view the latest entries.
Why These Two Files Matter
index.md= navigator that lets an agent explore the bundle without loading all of it into contextlog.md= readable audit trail that shows how the wiki has evolved and helps an agent understand what was recently changed
okf-validate.pychecks thatindex.mdhas no frontmatter (exceptokf_versionat the root) and warns if any date heading inlog.mdis not in ISO format.
That concludes the core concepts. Next, put it all into practice → Ingest: Adding Knowledge to the Wiki
Ingest: Adding Knowledge to the Wiki
Ingest is the process of taking raw source material and synthesizing it into concepts in the wiki. This is the heart of making the wiki "richer."
The Most Important Principle: Ingest Must Be Human-Supervised
🚫 Do not run ingest automatically in the background (as a background daemon)
This is the most critical anti-pattern — a daemon that swallows everything it sees will accumulate noise as fast as it accumulates signal, and the wiki will rot silently until nothing in it can be trusted.
Keep ingest as a deliberate human command — the human decision "is this source worth synthesizing?" is the quality gate that eliminates an entire class of failures.
Ingest Steps (Human-Supervised)
- Read the source in
raw/ - Read
wiki/index.mdto see what concepts already exist - Extract 5–15 key points (claims/decisions/insights) worth keeping
- Show the extracted points + their proposed mapping to concepts for your approval — then wait before writing anything
- Reconcile contradictions — if new information conflicts with an existing concept, add a flag to the old file:
> **CONTRADICTION FLAG**: New findings supersede this value. See references/metrics/new-wau.md - Write/update the concept (starting from the template), update
tags+timestamp - Update the relevant
index.md - Add an entry to
log.mdunder today's date - Run validate before considering the work done
Using via AI Agent
If the skill is installed, simply say:
ingest raw/q3-strategy.pdf into the wiki
The agent will follow the steps above — extracting key points, showing them for your approval first, then writing the concept,
and updating index.md and log.md.
Why Reconcile Contradictions Every Time
Suppose a paper says "Model Z is best" but the wiki has a page that says "Model X is best."
- Traditional RAG: both pages coexist; the agent may retrieve the old page and answer incorrectly with confidence.
- OKF (agentic ingest): every time knowledge is added, it checks the surroundings for conflicts, supersessions, and confirmations, then writes those relationships explicitly — the old page gets flagged "superseded, see B" and the new page gets context "updated from A." Both pages are correct simultaneously — the wiki has a consistent "present tense."
Tips
- Ingest one source at a time and stay present to review it — read the summary, check the updates, guide the focus.
- A single source may touch 10–15 pages in the wiki (concepts + entities + indexes + log).
- Start ingest selectively — no need to pour everything in at once. The cold-start problem is smaller than you think, because the topics you care about most tend to be covered first.
Next: once you have knowledge in the wiki, how do you query and search it? → Query and Search
Query and Search
There are two ways to retrieve knowledge from the wiki, depending on its size.
Method 1 — Query via Index (Small Wiki)
For small wikis (up to ~150 pages), reading index.md is sufficient:
- Read
wiki/index.mdfirst to find relevant concepts. - Drill into those specific concept files.
- Answer only from the concepts loaded and always cite the Concept ID.
- If the coverage is incomplete, say so directly and offer to ingest additional sources.
Via agent: "What does the wiki say about WAU?" → agent reads the index → opens the concept → answers with citations.
Method 2 — Search with an Index (Large Wiki)
Beyond ~150 pages, scanning the index becomes slow. Build a search index instead:
# Build a BM25 index (once / rebuild when content changes)
python3 tools/okf-index.py build ./wiki
# Search
python3 tools/okf-search.py "how is WAU defined" --bundle ./wiki -k 8
Results (ranked by BM25):
mode: bm25-only (no embeddings — run okf-embed.py build)
2.675 metrics/weekly-active-users [Metric] Number of unique customers with orders...
2.463 tables/customers [BigQuery Table] ...
What is BM25: A keyword-based retrieval algorithm that scores relevance. Excellent for matching exact terms and codes (e.g. policy codes, column names). The tool is pure Python with no additional dependencies.
Hybrid Search (BM25 + Semantic)
If keyword search alone doesn't capture enough meaning, add a semantic layer using a local embedding model (Ollama):
ollama pull nomic-embed-text # once (on-prem)
python3 tools/okf-embed.py build ./wiki # build embeddings → wiki/.okf-embed.json
python3 tools/okf-search.py "active customers" --bundle ./wiki
# → mode: hybrid (bm25+semantic, RRF)
okf-search.py will fuse BM25 + semantic results using Reciprocal Rank Fusion (RRF) automatically.
Always safe: If embeddings haven't been built yet, or Ollama isn't running, search will fall back to BM25 automatically and report the mode — semantic is a pure upgrade, not a hard dependency.
(Architecture details for hybrid search are in the Search at Scale and Semantic Search chapter.)
Comparison
| Situation | Use |
|---|---|
| Small wiki (< ~150 pages) | Read index.md directly |
| Large wiki, matching keywords/codes | okf-search.py (BM25) |
| Need to match meaning/synonyms | Hybrid (BM25 + semantic) |
Next: how to add and edit concepts → Adding and Editing Concepts
Adding and Editing Concepts
Start from the Template
Use tools/concept-template.md as your starting point:
---
type: Concept # required — choose from the controlled vocabulary in AGENTS.md
title: <human-readable name>
description: <one-line summary — used to decide whether an agent will load this file>
resource: <URI of the real-world thing, or delete this line if purely abstract>
tags: [<tag1>, <tag2>]
timestamp: 2026-06-15T00:00:00Z # ISO 8601 UTC
---
# <Title>
Use structure: headings, short bullets, tables — rather than long paragraphs.
# Related
Link to other concepts using relative paths, e.g. [orders](../tables/orders.md).
The relationship type belongs in this sentence, not in the link text itself.
# Citations
Cite the sources in raw/ that were used to synthesize this concept.
Steps to Add / Edit a Concept
- Copy the template → name the file (kebab-case, stable).
- Set
typecorrectly (from the controlled vocabulary). - Write the body using structure + relative links.
- Update
tags+timestamp. - Update the
index.mdof that directory. - Add an entry to
wiki/log.md. - Run
okf-validate.py.
Conventional Section Order
Use when appropriate, in this order (derived from Google's enrichment prompt):
# Overview ← 1–3 paragraph intro: what it is, how it's used
# Schema ← column/field summary (nested RECORDs as sub-sections/tables)
# Common query patterns ← 1–3 SQL snippets (```sql code blocks)
# Joins ← which concepts this joins to, via which keys
# Citations ← references (format: [1] [Title](url))
Example Reference Concept (join)
File wiki/references/joins/orders__customers.md:
---
type: Reference
title: Orders → Customers join
description: How to join the orders table to customers via customer_id
tags: [join, sales]
timestamp: 2026-06-15T00:00:00Z
---
Join [orders](../../tables/orders.md) with [customers](../../tables/customers.md)
via `customer_id` (many orders → one customer).
# Common query patterns
```sql
SELECT c.email, COUNT(*) AS orders, SUM(o.total) AS ltv
FROM orders o JOIN customers c USING (customer_id)
GROUP BY c.email;
```
After Editing — Definition of Done
-
Concept has a non-empty
typeand a sharpdescription -
timestampupdated to current -
Relevant
index.mdupdated -
Entry added to
log.mdunder today's date -
python3 tools/okf-validate.py ./wikipasses
Next: validate and visualize the wiki → Validate and Visualize
Validate and Visualize
Validate — Check Conformance
Run this after every edit:
python3 tools/okf-validate.py ./wiki
# → ✓ CONFORMANT with OKF v0.1 (0 warning(s), 0 info)
Conformance Criteria (OKF v0.1)
A bundle passes when:
- Every
.mdfile that is not a reserved file has parseable YAML frontmatter - Every frontmatter has a non-empty
typefield - Reserved files (
index.md,log.md) that exist follow the defined structure
Result Levels
| Level | Meaning | Example |
|---|---|---|
| ✗ error | Not conformant (must fix) | Missing frontmatter / missing type / index.md has disallowed frontmatter |
| ! warn | Passes, but should fix | Link starts with / (breaks GitHub) / log heading is not ISO |
| · info | Not a problem | Broken link (spec §5.3 permits this) |
Consumers must not reject a bundle because of: missing optional fields, unknown
type, extra keys, broken links, or a missingindex.md— this is "permissive consumption," which keeps OKF usable even as bundles grow or get refactored.
Visualize — View the Knowledge Graph
python3 tools/okf-viz.py ./wiki --name "My Wiki"
# → wiki/viz.html (single file, open in a browser)
viz.html is a single self-contained HTML file — it embeds the library (Cytoscape + marked) and the bundle data
directly inside, fetching nothing from the network when opened. Suitable for air-gapped environments, file sharing, or committing alongside the bundle.
What the Viewer Shows
- Force-directed graph of all concepts, colored by
type, with edges drawn from links in the content - Detail panel for the selected concept: frontmatter + rendered body
- "Cited by" — backlinks (which other concepts link to this one)
- Search box (matches title/id/tags), type filter, and toggleable layouts
By default it embeds libraries from
tools/vendor/, enabling true air-gap use. To use a CDN instead, pass--cdn.
Try the Live Demo
Below is the viz.html of the example wiki included in this project (click a node to see its details; try searching and filtering by type):
Make It a Habit
Combine both commands after every editing session:
python3 tools/okf-validate.py ./wiki && python3 tools/okf-viz.py ./wiki
At an organizational level, CI runs validate on every PR and regenerates the viz automatically (see Part 6).
That wraps up the day-to-day usage section. Next, see guidelines for writing well → Authoring Guidelines and Anti-Patterns
Worked Example: a Bookstore KB
This chapter walks the whole path from zero to searchable — a small knowledge base for an online bookstore, with real commands. Use it as a template for your own domain.
1. Scaffold (init)
python3 tools/okf-init.py ./bookstore-kb
cd bookstore-kb
You get AGENTS.md + wiki/{index.md, log.md, getting-started.md} + raw/.
2. Add knowledge (simulated ingest)
Create concepts in the canonical layout — two tables, a metric, a join, a playbook. Example
wiki/tables/books.md:
---
type: BigQuery Table
title: Books
description: One row per book in the catalog.
tags: [catalog, books]
timestamp: 2026-06-16T00:00:00Z
---
# Schema
| Column | Type | Description |
| :--- | :--- | :--- |
| book_id | STRING | Book id (PK) |
| author_id | STRING | FK to [authors](authors.md) |
| price | NUMERIC | Price (USD) |
| stock | INT64 | Units in stock |
# Joins
Join [authors](authors.md) on `author_id` — see [Books → Authors](../references/joins/books__authors.md).
Low stock → [restock runbook](../playbooks/low-stock-runbook.md)
Do the same for authors, references/metrics/monthly-revenue, references/joins/books__authors,
and playbooks/low-stock-runbook, then update wiki/index.md + wiki/log.md.
In real use: drop a source into
raw/and have the agent run the supervised INGEST flow (Ingest chapter) — it extracts claims, shows them for approval, then writes the concepts and updates index/log for you.
The resulting KB is this graph:
flowchart LR D["datasets/bookstore"] --> B["tables/books"] D --> A["tables/authors"] B -->|author_id| A J["references/joins/books__authors"] --> B J --> A M["references/metrics/monthly-revenue"] --> D P["playbooks/low-stock-runbook"] --> B
3. Validate conformance
python3 tools/okf-validate.py ./wiki
# → ✓ CONFORMANT with OKF v0.1 (0 warning(s), 0 info)
4. Search
python3 tools/okf-index.py build ./wiki # build the BM25 index
python3 tools/okf-search.py "how to join books to authors" --bundle ./wiki
# → references/joins/books__authors [Reference] ranked #1
Add semantic (optional, on-prem):
ollama pull nomic-embed-text
python3 tools/okf-embed.py build ./wiki
python3 tools/okf-search.py "who wrote the novels" --bundle ./wiki
# → mode: hybrid (bm25+semantic, RRF) — finds authors even with no keyword overlap
5. Visualize
python3 tools/okf-viz.py ./wiki --name "Bookstore KB"
open wiki/viz.html # one file, opens offline
6. (Team) Share via MCP
To let the whole team/agents use it: commit wiki/ to an internal git server and bring up the
MCP server pointed at this bundle — agents can then okf_search (hybrid)
and propose changes via PR/lease.
Summary
okf-init → (ingest concepts) → okf-validate → okf-index/okf-search → okf-viz → [MCP share]
That's the full loop! Next, see Authoring Well to grow the KB with quality.
Writing Guidelines and Anti-patterns
These guidelines layer on top of the spec (which only enforces type) to make the wiki reliable for both humans and agents.
Golden Rules
- One concept per file — do not cram three topics into one file
- Write
descriptionfor the agent — this is the single line an agent reads to decide whether to load the file; make it specific - Structure beats paragraphs — headings, bullets (
**key** — value), tables; models extract information from structured Markdown more accurately - identity = path — use stable kebab-case filenames; renaming breaks inbound links
- Consistent
type— use a controlled vocabulary - Always cite sources — every synthesized claim should be traceable to a file in
raw/; add them under# Citations
Anti-pattern Table
| Anti-pattern | Why it is bad | Do this instead |
|---|---|---|
| Automated background ingest | Accumulates noise as fast as signal; the wiki decays silently | Make ingest an explicit human-triggered command with review |
Dumping raw PDFs into wiki/ | Unreliable retrieval; breaks synthesis | Synthesize into a concept Markdown file; keep the raw file in raw/ |
| Over-stuffed frontmatter | Noise reduces search precision | Keep tags semantic and minimal |
| Skipping heading levels (H1→H3) | Breaks document structure for the model | Maintain H1→H2→H3 order |
| Paragraphs interspersed in lists | The list splits into fragments in the parser's view | Use nested paragraphs, or close the list first |
Inconsistent type/field names | Tools cannot aggregate the data | Use a controlled vocabulary |
| Vague anchor text ("click here") | No topic signal for the LLM | Use descriptive link text |
Absolute paths starting with / | Breaks GitHub rendering | Use relative paths |
| Sacrificing readability for machines | The wiki must serve humans too | Structure for machines; clarity for humans |
Core Spec vs. Community Best Practices
Keep these separate:
- OKF v0.1 spec (very small): enforces only
type+ index/log rules + conformance rules - Best practices (in this book): mostly from the LLM-wiki community and the Google reference implementation
— e.g., confidence decay, hybrid search, grouping
references/— these are supplementary patterns, not requirements
Points Where Sources Disagree
- "RAG is dead" — most engineers say wiki = Layer 1, RAG = Layer 2 fallback; it is not either/or
- Absolute vs. relative links — the spec recommends absolute, but implementations use relative (we follow the implementation)
- Four-dimension freshness scoring — a guideline promoted by a vendor (Atlan), not part of the spec
Next, move to the enterprise level → Architecture Overview
Architecture Overview (Enterprise)
When a knowledge base must be shared across sessions and teams at the enterprise level on-prem (internal network, works even air-gapped), the core idea fits in one sentence:
Git is the source of truth · the internal MCP server is the central access layer
Every session and every agent — regardless of framework or model — connects to the same single internal MCP endpoint. Git history is "cross-session memory" by nature · reads are instant · writes go through a gate (PR or lease)
Diagram
flowchart TB
subgraph net["Internal network (on-prem / air-gap)"]
MCP["OKF MCP server
search · get · propose / commit"]
GIT["Git server (Gitea/GitLab)
= OKF repo · source of truth"]
IDX["Search index
BM25 (+ semantic)"]
CI["CI runner
okf-validate + viz"]
MCP -->|"pull / webhook"| GIT
MCP --> IDX
GIT -->|"PR/MR triggers"| CI
end
A1["Agent team A"] -->|"MCP: HTTP/SSE + token/mTLS"| MCP
A2["Agent team B"] --> MCP
CR["CI / cron"] --> MCP
Components
| # | Component | Self-host option | Role |
|---|---|---|---|
| 1 | Git server | Gitea / GitLab CE | Versioned source of truth (one repo per team, or monorepo + CODEOWNERS) |
| 2 | OKF MCP server | server/okf_mcp_server.py | Access layer that every agent connects to — read/search/propose; stdio + HTTP/SSE |
| 3 | Search index | tools/okf-index.py (+ embed) | Fast search once the wiki grows beyond ~150 pages |
| 4 | CI gate | Gitea Actions / GitLab CI | Blocks non-conformant merges + regenerates viz |
| 5 | Reverse proxy | Caddy / Traefik / nginx | TLS + auth (token / OIDC / mTLS) in front of MCP |
Read Path (normal case — fast, no lock)
- Agent calls
okf_search("how is WAU defined")→ receives ranked Concept IDs okf_get_concept("metrics/weekly-active-users")→ frontmatter + body loaded into contextokf_read_index()when progressive disclosure exploration is needed
Concurrent reads are unlimited with no contention.
Write Path
Two models to choose from (next chapter):
- PR-gated (default): write via branch + PR → CI checks → human/curator merges — safe, with audit/review
- Lease/lock: for heavy-write teams — per-concept lease prevents collision, writes directly to a shared branch
Meaning of "Cross-session / Cross-team"
- Cross-session: no per-session state — a new session
git pulls everything a previous session wrote; the wiki compounds automatically - Cross-team: monorepo +
CODEOWNERSper subtree or federated bundles (one repo per domain, MCP server mounts multiple bundles and namespaces them by bundle name, e.g.sales:tables/orders)
Next: deploy it → Self-host Setup
Self-host Setup
deploy/docker-compose.yml brings up the full internal stack on a single VM — no external exposure, works even air-gapped.
Stack Overview
gitea → source of truth (git server)
okf-mcp → access layer that agents connect to (built from deploy/Dockerfile)
proxy → Caddy: TLS + auth (token/OIDC/mTLS) in front of MCP
Steps
cd deploy
cp .env.example .env # set OKF_GIT_REMOTE, token, etc.
docker compose up -d
Required .env values:
OKF_GIT_REMOTE=http://gitea:3000/okf/knowledge.git # repo that MCP will clone/pull
OKF_GITEA_API=http://gitea:3000/api/v1 # for opening PRs automatically from propose_change
OKF_GITEA_TOKEN=<token>
OKF_GITEA_OWNER=okf
OKF_GITEA_REPO=knowledge
OKF_TOKEN=<long-random> # token agents must send to the proxy
OKF_READONLY=0 # 1 = read-only replica
Connecting an Agent to the MCP Endpoint
In Claude Code (or any MCP client), point to the internal URL:
{ "mcpServers": { "okf": {
"transport": "http",
"url": "https://okf.internal.example/mcp",
"headers": { "Authorization": "Bearer ${OKF_TOKEN}" }
} } }
Local / Dev Mode (no network required)
Test the server without standing up the full stack:
python3 server/okf_mcp_server.py # stdio transport
# or HTTP:
OKF_MCP_TRANSPORT=streamable-http OKF_MCP_PORT=8765 \
python3 server/okf_mcp_server.py
Tools the Server Exposes to Agents
| Tool | What it does |
|---|---|
okf_search(query, k, type) | Hybrid search (BM25 + semantic if available) |
okf_get_concept(id) | Returns the frontmatter + body of a concept |
okf_list_concepts(prefix) | Lists id/type/description |
okf_read_index(path) | Reads index.md (progressive disclosure) |
okf_propose_change(...) | Writes via branch + PR (PR mode) |
okf_acquire_lease / renew / release / list_leases | Acquires write rights (lease mode) |
okf_commit_concept(..., token) | Writes directly while holding a lease (lease mode only) |
CI Conformance Gate
Set branch protection on main to require this check — non-conformant bundles cannot be merged.
- Gitea Actions:
.gitea/workflows/conformance.yml - GitLab CI:
ci/.gitlab-ci.yml
Both run okf-validate.py (blocks on failure) + rebuild the index + regenerate viz.html.
Basic Security
- The MCP endpoint is behind a reverse proxy that handles TLS + auth (details in the Security and Governance chapter)
raw/is excluded from the bundle (.gitignoreprevents accidentally committing private data)- Air-gap:
viz.htmlbundles its libraries inline; semantic search uses on-premises Ollama — nothing leaves the network
Next: choose a write model → Write Models: PR-gated and Lease
Write Models: PR-gated and Lease
When multiple agents write knowledge concurrently, you must choose how to manage concurrency. The starter supports two models
(plus a third as a concept). Select with the OKF_WRITE_MODE env var.
Model 1 — PR-gated (default)
okf_propose_change(...) does not write to main — instead it:
- Creates branch
okf/<concept>-<id>frommain - Writes the file, commits, and pushes to the internal git server
- Opens a PR/MR via the git server API and returns the URL
- CI runs
okf-validate.py(+ regenerates viz) → a human/curator reviews and merges
flowchart LR AG["Agent"] -->|propose_change| BR["branch + commit"] BR --> PR["Pull Request"] PR --> CIv["CI: okf-validate"] CIv --> RV["review (human/curator)"] RV -->|merge| MAIN["main"] MAIN -->|webhook| RE["MCP pull + reindex"]
Full enterprise properties: audit trail (git log), review/diff, rollback (git revert), no write conflicts (merges applied one at a time in order), quality gate (CI + review)
Once merged to main → webhook → MCP server pull → reindex → all sessions see the update
Suitable when reviewing every change matters more than speed
Model 2 — Lease/lock (direct write)
Enable with OKF_WRITE_MODE=lease for heavy-write teams. Uses a lease (TTL-based write reservation) to prevent two agents
from editing the same concept simultaneously.
# per-agent flow:
okf_acquire_lease("tables/orders", ttl_seconds=300) # → {token, expires_at}
# ... make edits ...
okf_commit_concept("tables/orders", frontmatter, body, token=...) # server checks lease then write+commit+push
okf_release_lease("tables/orders", token=...) # release when done
flowchart LR ACQ["acquire_lease
(TTL)"] --> ED["edit concept"] ED --> CM["commit_concept
(verify token)"] CM --> PUSH["pull --rebase + push"] PUSH --> REL["release_lease"] OT["another agent requests the same concept"] -.->|locked| ACQ
Another agent requesting the same concept receives {error:"locked", held_by} → it works on something else instead.
Concurrency safety: the lease prevents duplicate concepts + git pull --rebase before push handles
commits to different files (auto-merge).
Lease properties:
- Advisory + self-expiring TTL — a crashed agent does not leave a concept permanently locked (an expired lease can be "stolen")
- Token-verified — only the lease holder can commit/renew/release
- Single-authority — one MCP server issues all leases using atomic file creation (
O_CREAT|O_EXCL); git is serialized with a lock
CLI for ops: python3 tools/okf-lease.py list · ... break <concept> (admin force-release)
Suitable when write throughput matters more than reviewing every change
Model 3 — Append-only proposals + curator (concept)
Agents drop proposals into inbox/; a single curator agent collects them, resolves conflicts, then merges into the wiki
— the highest quality gate, consistent with the supervised-ingest philosophy (not yet implemented in the starter)
Comparison
| Model | Speed | Review | Conflict | Best for |
|---|---|---|---|---|
| PR-gated | Moderate | Every change | None (sequential merge) | Governance, general teams |
| Lease/lock | Fast | Lightweight | Prevented by lease | Heavy-write teams |
| Curator | Slow | Maximum | Curator handles it | Quality above all else |
💡 Both modes can run against the same repo — a heavy-write team uses the server in lease mode; everyone else uses PR-gated, all pointing to the same git remote
Next: search as the wiki scales → Search at Scale and Semantic Search
Search at Scale and Semantic Search
As the wiki grows, scanning a flat index.md becomes slow. This chapter covers search that scales, entirely on-prem.
By Wiki Size
| Size | What to use |
|---|---|
| < ~150 concepts | index.md progressive disclosure is sufficient |
| > ~150 | Build a BM25 index (okf-index.py build) — okf_search uses it automatically |
| Recall is insufficient | Add a semantic + RRF layer (below) |
| Very large / multi-team | Federate bundles, one MCP server per domain, cache index in memory and rebuild via webhook |
Semantic Layer (on-prem)
Embed concepts using a self-hosted embedding model (via Ollama) — nothing leaves the network:
ollama pull nomic-embed-text # one-time
python3 tools/okf-embed.py build ./wiki # → wiki/.okf-embed.json
python3 tools/okf-search.py "active customers" --bundle ./wiki
# → mode: hybrid (bm25+semantic, RRF)
Reciprocal Rank Fusion (RRF)
okf-search.py combines BM25 (lexical) and semantic (vector) results using RRF:
fused(doc) = Σ_signals 1 / (RRF_K + rank) (RRF_K = 60)
Each signal ranks its own set of docs, then the reciprocal of each rank is summed — docs that rank well across multiple signals rise to the top, without any single signal dominating.
Why two signals:
- BM25 excels at exact keyword/code matching (policy codes, column names)
- semantic excels at meaning/synonym matching ("active customers" ↔ "weekly active users")
Automatic Fallback (important)
If embeddings have not been built or Ollama is not running → search automatically falls back to BM25 and reports the mode so you know. Semantic search is therefore a purely opt-in upgrade with no hard dependency.
This makes the system resilient: machines without Ollama can still search using BM25.
Scaling Further
- Federate bundles — separate repo per domain/team
- One MCP server per domain behind a single gateway
- Cache index in memory and rebuild incrementally when a webhook signals a merge
- Indexes/embeddings are generatable artifacts (built by CI / MCP server on demand) — no need to commit
(
.okf-index.json,.okf-embed.jsonare in.gitignore)
Next: security and governance → Security and Governance
Security and Governance
A self-hosted internal system must control who can read/write what, and must be auditable after the fact. OKF gets much of its governance for free from git.
Two-Layer Access Control
Git Layer
- Org/team membership on Gitea/GitLab
- Branch protection on
main— enforces passing CI + review before merge
MCP Layer
Place the HTTP transport behind a reverse proxy with:
- mTLS (service-to-service) or OIDC/SSO (for humans) + a bearer token per agent identity
- Map identity → role:
| role | allowed |
|---|---|
reader | search / get / list only |
proposer | + okf_propose_change (branch/PR only, cannot touch main) |
curator | may merge (via git server, not via MCP) |
Set OKF_READONLY=1 to run an MCP replica in read-only mode.
Caddy (proxy) Example — Bearer Token
okf.internal.example {
# internal CA / self-signed for air-gap:
# tls /etc/caddy/okf.crt /etc/caddy/okf.key
# or mTLS:
# tls { client_auth { mode require_and_verify trusted_ca_cert_file /etc/caddy/internal-ca.crt } }
@noauth not header Authorization "Bearer {$OKF_TOKEN}"
respond @noauth "Unauthorized" 401
reverse_proxy okf-mcp:8765
}
For SSO, replace the bearer check with forward_auth to an OIDC proxy (e.g. oauth2-proxy) and map identity → role.
Secrets and PII
raw/is not included in the bundle and is already in.gitignore— prevents accidentally pushing private or large data- Do not put credentials in concepts — concepts are knowledge, not a secrets store
- PR review helps catch sensitive data before it reaches
main(another reason PR-gated writes are good in organizations)
Air-Gap (Closed Networks)
viz.htmlembeds its libraries (Cytoscape + marked) inline — it does not fetch from a CDN at render time- Semantic search uses a self-hosted embedding model (Ollama) — nothing leaves the network
- All tools are Python stdlib only (except the MCP server, which uses the
mcppackage — installable from an internal mirror)
Audit & Rollback (from git)
| Task | Command |
|---|---|
| Who changed WAU? | git log --follow wiki/metrics/weekly-active-users.md |
| Revert incorrect knowledge | git revert <sha> → PR → merge |
| Change timeline | Read wiki/log.md or git log |
Governance for Free
Because the wiki is files in git, every change has a diff, blame, review, history, and rollback — just like normal software development. Knowledge maintenance becomes an engineering workflow the team already knows.
End of the organizational section. See the appendix for reference → Tool Reference (CLI)
Tool Reference (CLI)
All tools are pure Python stdlib, located in tools/ (or scripts/ if installed as a skill).
okf-init.py — Create a New Bundle
python3 tools/okf-init.py [target_dir] [--force] [--date YYYY-MM-DD]
Creates AGENTS.md + wiki/{index.md, log.md, getting-started.md} + raw/ · Does not overwrite a non-empty wiki/ unless --force
okf-validate.py — Check Conformance
python3 tools/okf-validate.py [wiki_dir]
Exits 0 if conformant, 1 if not · error = missing frontmatter/type, index.md rule violation · warn = /-prefixed links, non-ISO log dates · info = broken link
okf-viz.py — Generate Graph Viewer
python3 tools/okf-viz.py [bundle] [-o out.html] [--name "Name"] [--cdn]
Produces a single self-contained viz.html (embeds Cytoscape + marked) · --cdn = load libraries from CDN instead of embedding
okf-index.py — BM25 Search Index
python3 tools/okf-index.py build [bundle] [-o index.json]
python3 tools/okf-index.py query "question" [-k 8] [--type Metric]
okf-embed.py — Embeddings (Ollama)
python3 tools/okf-embed.py build [bundle]
python3 tools/okf-embed.py query "question" [-k 8]
env: OKF_OLLAMA_URL (default http://localhost:11434), OKF_EMBED_MODEL (default nomic-embed-text)
okf-search.py — Hybrid Search (BM25 + semantic, RRF)
python3 tools/okf-search.py "question" [--bundle ./wiki] [-k 8] [--type ...] [--bm25-only]
Automatically falls back to BM25 if embeddings are absent or Ollama is not running.
okf-lease.py — Lease/Lock Concurrency
python3 tools/okf-lease.py acquire <concept> --owner <id> [--ttl 300]
python3 tools/okf-lease.py renew <concept> --owner <id> --token <tok> [--ttl 300]
python3 tools/okf-lease.py release <concept> --owner <id> --token <tok>
python3 tools/okf-lease.py list
python3 tools/okf-lease.py break <concept> # admin force-release
env: OKF_LEASE_DIR (lease storage location), OKF_LEASE_TTL (default 300)
okf-selftest.sh — exercise the whole toolchain
bash tools/okf-selftest.sh
Runs 10 end-to-end checks (init → validate both a clean bundle and a deliberately-broken one → index → search → air-gap viz → lease → embed/hybrid if Ollama is up) · exits non-zero on any failure · suitable for CI.
install.sh — Install Skill
./install.sh # global → ~/.claude/skills/okf
./install.sh --project # project → ./.claude/skills/okf
./install.sh --dir <path> # custom
./install.sh --uninstall
server/okf_mcp_server.py — MCP Access Layer
python3 server/okf_mcp_server.py # stdio
OKF_MCP_TRANSPORT=streamable-http OKF_MCP_PORT=8765 python3 server/okf_mcp_server.py
Key env vars: OKF_REPO_DIR, OKF_BUNDLE, OKF_BASE_BRANCH, OKF_WRITE_MODE (pr|lease),
OKF_READONLY, OKF_GITEA_API/TOKEN/OWNER/REPO, OKF_AGENT_ID
Frequently Asked Questions (FAQ)
How is OKF different from Obsidian / Notion?
They are quite similar (Markdown + frontmatter + links), but OKF is specified — it defines the minimal rules necessary for interoperability (e.g. type is required, reserved files, conformance rules) without mandating a tool. You can open an OKF bundle in Obsidian/MkDocs/Hugo directly because it is just Markdown.
Do I need Google Cloud / BigQuery?
No. OKF is vendor-neutral. The reference examples use BigQuery but type can be anything. This starter is not tied to any cloud provider.
Do I need an AI agent?
No. You can write concepts by hand yourself (they are just Markdown). AI agents simply handle the heavy lifting — summarizing, cross-referencing, filing, and bookkeeping.
Should links be relative or absolute?
Relative only. Do not prefix with / because it breaks GitHub rendering (see Linking). Although the spec suggests absolute links, the actual Google implementation uses relative ones.
Is a broken link an error?
No — it represents "knowledge not yet written." okf-validate.py reports it as info, not an error.
Can I fully automate ingest?
Strongly not recommended — a background daemon will accumulate noise until the wiki rots. Ingest should be a human-triggered command with a review step.
How large does the wiki need to be before search is necessary?
Around ~150 pages. Before that, index.md is sufficient. Beyond that, use okf-search.py (BM25) and add semantic search when recall is insufficient.
Do I need Ollama to search?
No. BM25 works without Ollama. Semantic search is an optional upgrade. If Ollama is not running, search automatically falls back to BM25.
What do I do when multiple agents write concurrently?
Choose a write model: PR-gated (safe, with review) or lease/lock (faster, for write-heavy teams). See Write Models.
Does it work air-gapped (closed network)?
Yes — viz.html embeds its libraries, tools use stdlib only, semantic search uses on-prem Ollama, and git/MCP run internally.
Do .okf-index.json / .okf-embed.json need to be committed?
No — they are generatable artifacts and are already in .gitignore. CI/MCP server regenerates them on demand.
validate shows "missing type" — how do I fix it?
Every concept must have a non-empty type in its frontmatter. Add type: ... (chosen from the controlled vocabulary).
How do I migrate an existing wiki (Obsidian, etc.) into OKF?
In most cases: add a type field to each frontmatter, add index.md/log.md, convert links to relative paths, then run validate.
How do I verify the whole toolchain works?
Run bash tools/okf-selftest.sh — 10 end-to-end checks (init/validate/index/search/viz/lease + embed/hybrid if Ollama is up); exits non-zero on any failure. See the Worked Example for a hands-on full-loop walkthrough.
Glossary
| Term | Definition |
|---|---|
| OKF (Open Knowledge Format) | An open specification for storing knowledge as a directory of Markdown files + YAML frontmatter (v0.1, Google Cloud, 2026) |
| Bundle | A directory containing all knowledge files — the unit of distribution (in this project: wiki/) |
| Concept | One unit of knowledge = one .md file |
| Concept ID | The file's path within the bundle with .md stripped, e.g. tables/orders.md → tables/orders |
| Frontmatter | A YAML block at the top of a file, delimited by ---, storing metadata |
| Body | The Markdown content below the frontmatter |
| Link | A Markdown link between concepts = a relationship (untyped) |
| Citation | A link from a concept to an external source that supports a claim in the content |
| Reserved file | Reserved files: index.md (table of contents), log.md (change log) |
| Progressive disclosure | Showing the table of contents before opening the actual file — reduces context window overflow |
| Conformance | A bundle following the v0.1 rules (parseable frontmatter + type present + reserved files correctly structured) |
| type | The only required frontmatter field — specifies the kind of concept |
| Reference | A type used for synthesized knowledge (joins, metric definitions), typically under references/ |
| RAG | Retrieval-Augmented Generation — fetching raw document chunks at query time and stuffing them into context |
| LLM-wiki pattern | A concept (Karpathy) where AI synthesizes knowledge into a continuously maintained Markdown wiki, instead of re-fetching raw sources every time |
| Ingest | The process of taking raw sources and synthesizing them into wiki concepts (should be human-supervised) |
| Contradiction flag | A > **CONTRADICTION FLAG**: ... marker added when new information conflicts with existing content |
| BM25 | A keyword-based ranking algorithm that scores document relevance |
| Embedding | A vector representing the semantic meaning of text (generated by a model, e.g. via Ollama) |
| Semantic search | Search by semantic similarity (cosine distance between embeddings) |
| Hybrid search | Combining BM25 + semantic search |
| RRF (Reciprocal Rank Fusion) | A method for merging ranked results from multiple signals: Σ 1/(k + rank) |
| MCP (Model Context Protocol) | An open standard for AI agents to connect to external tools and data |
| MCP server | A service that exposes tools (search/get/propose) wrapping a bundle for agent connections |
| PR-gated | A write model using branch + Pull Request + CI + review |
| Lease/lock | A write model that reserves exclusive rights to a concept with a TTL, preventing write conflicts |
| Lease | A TTL-based reservation that expires automatically, verified by a token |
| Curator | (Model 3) A single agent that aggregates proposals and merges them into the wiki |
| CODEOWNERS | A file defining owners per subtree (used to split responsibility across teams in a monorepo) |
| Federated bundles | Multiple repos/bundles per domain; MCP servers mount multiple bundles, namespaced by bundle name |
| air-gap | An environment with no internet connection (closed network) |
| Gitea / GitLab CE | Self-hostable git servers used as the source of truth |
| AGENTS.md | A schema file that tells agents about the structure/rules/workflow (may be named CLAUDE.md/GEMINI.md) |
References
Compiled references for the History and Foundations chapters.
Numbers [1]–[6] correspond to the eras in the History chapter. Gathered through deep research (June 2026); primary sources are cited wherever possible.
[1] Era of Expert Systems & Knowledge Representation
- Shortliffe, E. H. (1976). Computer-Based Medical Consultations: MYCIN. Elsevier/North-Holland.
- Buchanan, B. G., & Shortliffe, E. H. (Eds.) (1984). Rule-Based Expert Systems: The MYCIN Experiments of the Stanford Heuristic Programming Project. Addison-Wesley.
- Minsky, M. (1974). A Framework for Representing Knowledge. MIT AI Lab Memo 306.
- Feigenbaum, E. A., Buchanan, B. G., & Lederberg, J. (1971). On Generality and Problem Solving: A Case Study Using the DENDRAL Program. Machine Intelligence, 6.
- Lenat, D. B., & Guha, R. V. (1990). Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project. Addison-Wesley.
[2] Era of Ontologies & the Semantic Web
- Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The Semantic Web. Scientific American, 284(5), 34–43. https://www.lassila.org/publications/2001/SciAm.html
- Gruber, T. (1993). A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(2).
- W3C (1999, rev. 2004, 2014). Resource Description Framework (RDF). https://www.w3.org/RDF/
- W3C (2004, rev. 2009). OWL Web Ontology Language Overview. https://www.w3.org/OWL/
- schema.org (2011). schema.org — shared structured-data vocabulary (Google, Bing, Yahoo). https://schema.org/
[3] Era of Databases & Information Retrieval
- Codd, E. F. (1970). A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, 13(6), 377–387. https://dl.acm.org/doi/10.1145/362384.362685
- Spärck Jones, K. (1972). A Statistical Interpretation of Term Specificity and Its Application in Retrieval. Journal of Documentation, 28(1), 11–21.
- Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1994). Okapi at TREC-3. Proceedings of TREC-3.
- Apache Software Foundation. Apache Lucene (Doug Cutting, 1999). https://lucene.apache.org/
- Elastic (2010). Elasticsearch (Shay Banon). https://www.elastic.co/
[4] Era of Wikis & Personal Knowledge Management
- Cunningham, W. (1995). WikiWikiWeb — Portland Pattern Repository. https://en.wikipedia.org/wiki/WikiWikiWeb
- Wales, J., & Sanger, L. (2001). Wikipedia. Wikimedia Foundation. https://en.wikipedia.org/wiki/Wikipedia
- Luhmann, N. (1981). Kommunikation mit Zettelkästen. In Öffentliche Meinung und sozialer Wandel. https://zettelkasten.de/introduction/
- Obsidian.md (2020). About Obsidian. https://obsidian.md/about
[5] Era of AI: Embeddings, Vector Search & RAG
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781. https://arxiv.org/abs/1301.3781
- Singhal, A. (2012). Introducing the Knowledge Graph: things, not strings. Google Blog. https://blog.google/products/search/introducing-knowledge-graph-things-not/
- Johnson, J., Douze, M., & Jégou, H. (2017). Billion-scale Similarity Search with GPUs (FAISS). arXiv:1702.08734. https://arxiv.org/abs/1702.08734
- Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 (NAACL-HLT 2019). https://arxiv.org/abs/1810.04805
- Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 (NeurIPS 2020). https://arxiv.org/abs/2005.11401
- Cormack, G. V., Clarke, C. L. A., & Buettcher, S. (2009). Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR 2009.
[6] Present & Future: LLM-Wiki, OKF & Agent Memory
- Karpathy, A. (2026). LLM Wiki [GitHub Gist]. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f
- McVeety, S., & Hormati, A. (2026, June 12). How the Open Knowledge Format Can Improve Data Sharing. Google Cloud Blog. https://cloud.google.com/blog/products/data-analytics/how-the-open-knowledge-format-can-improve-data-sharing
- GoogleCloudPlatform (2026). Open Knowledge Format — Specification & Reference Implementations [GitHub]. https://github.com/GoogleCloudPlatform/knowledge-catalog/tree/main/okf
- Packer, C., Wooders, S., Lin, K., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560. https://arxiv.org/abs/2310.08560
Reliability Notes
- BM25 date — The Okapi at TREC-3 paper (1994) is the primary citation, but the formula had been developing since the late 1980s; "~1994" refers to the key publication, not the first appearance.
- OKF date — Most secondary sources cite 12 June 2026, while the Google Cloud blog page shows 13 June (likely a timezone difference) — 12 June is used as the canonical publication date.
- Karpathy's gist — Sources place it between 3–4 April 2026 (X post on the 3rd, gist following on the 4th).
- The historical content is synthesised from primary sources with encyclopaedias and standards documents used as context. Readers should verify against the originals when high accuracy is required.