---
title: "Data Infrastructure Capsules"
slug: "data-infrastructure-capsules"
summary: "How Capsules flatten relational, graph, document, and vector data into portable artifacts for small projects and model-cooperative workflows."
status: "draft"
version: "0.1"
updated: "2026-05-07"
audience:
  - "technical adopters"
  - "agent builders"
  - "small teams"
tags:
  - "capsule"
  - "data infrastructure"
  - "knowledge graphs"
  - "nosql"
  - "vector search"
canonical_path: "/research/data-infrastructure-capsules"
---

# Data Infrastructure Capsules

## Flattening the Small Data Stack for Model Cooperation

### Abstract

Small AI-enabled projects increasingly need several data shapes at once: a few relational tables, a handful of documents, a small graph, and vector records for semantic search. The conventional answer is to provision several services. That is correct at production scale, but it is often premature when the immediate need is model cooperation, review, handoff, and reproducibility.

A data infrastructure capsule packages database-shaped artifacts into one portable `.capsule` file. It can contain schema, sample records, graph triples, document records, vector embeddings, query examples, expected answers, and provenance. The capsule does not replace databases. It delays database commitment until the workflow is understood and gives humans and agents a shared, verifiable substrate before infrastructure is deployed.

This paper records the first public data infrastructure example lane: a combined capsule plus separate relational, graph, NoSQL/document, and vector capsules staged in the companion examples repository: [capsules-extra/capsule-examples/public/data-infra](https://github.com/virionai/capsules-extra/tree/main/capsule-examples/public/data-infra).

---

## 1. Problem

Many projects begin with a small but heterogeneous data need:

- customer and invoice tables
- tickets or nested documents
- relationships between actors, documents, policies, and actions
- semantically searchable text snippets

The technical stack implied by those needs might include Postgres, Neo4j, MongoDB, and a vector database. The project may eventually need all of that. But early in the work, the hardest problem is not scale. It is shared understanding.

The model needs enough context to cooperate. A human reviewer needs enough structure to trust the answer. Another agent needs enough provenance to continue the work. Provisioning infrastructure before the task is understood creates friction and makes the early artifact less portable.

---

## 2. Capsule pattern

A data infrastructure capsule carries four things together:

1. data-shaped files
2. query intent
3. expected answers or evaluation criteria
4. event-chain provenance

In the current example pack, the combined capsule contains:

- `data/relational/schema.sql`
- `data/relational/customers.json`
- `data/relational/invoices.json`
- `data/graph/triples.json`
- `data/nosql/tickets.json`
- `data/vector/documents.json`
- `queries/examples.json`
- `infra-map.json`

The layer-specific capsules contain only one data shape. This lets a user experiment with one layer without carrying unrelated examples.

---

## 3. Why this works for small projects

The capsule is not pretending that JSON files are a production database. Instead, it creates a portable proving ground.

The same artifact can answer questions like:

- Which customer owns an overdue invoice?
- Which graph path connects a customer to a policy?
- Which support ticket has refund and billing tags?
- Which policy vector is closest to a query vector?

Because the expected answers are included, a human or agent can verify whether a model understood the data shape. Because the data and query examples are in the capsule, the task can move between machines, models, and users without provisioning a service.

---

## 4. Evolution path to hosted infrastructure

The combined capsule also documents how each layer can evolve:

| Capsule layer | Portable form | Local implementation | Cloud implementation |
|---|---|---|---|
| Relational | JSON + SQL schema | SQLite, DuckDB, local Postgres | Cloud SQL, AlloyDB, Neon, Supabase |
| Graph | triples JSON | Kuzu, RDF libraries, local Neo4j | Neo4j Aura, Neptune, managed graph service |
| Document | nested JSON | LowDB, PouchDB, MongoDB local | MongoDB Atlas, Firestore, DynamoDB |
| Vector | documents + embeddings | sqlite-vec, LanceDB, local vector index | pgvector, Qdrant, Weaviate, Atlas Vector Search |

The capsule should not contain secrets. It should contain the contract: the expected shape, data sample, query examples, and capability requirements. A host can later supply the correct library, connection string, identity grant, and network path.

This is the important boundary: the capsule flattens infrastructure for review and portability, then becomes the handoff contract when infrastructure is justified.

---

## 5. Relationship to vector databases

MongoDB Vector Search documentation describes vector search as a way to search data by semantic meaning, combine vector search with full-text search, and filter by fields in the collection. It also positions vector search as useful for RAG and agentic systems.

That maps cleanly to data infrastructure capsules. The capsule can carry a tiny deterministic vector example for review. When scale or live retrieval is needed, the same contract can move to a real vector backend.

The capsule remains useful even after the database exists because it can preserve:

- which embeddings were expected
- what query was run
- which retrieval result was accepted
- which agent/user made the decision
- what version of the schema or index was in force

---

## 6. Test status

The companion examples lane currently contains:

- [data-infra-mini.capsule](https://github.com/virionai/capsules-extra/blob/main/capsule-examples/public/data-infra/data-infra-mini.capsule)
- [relational-mini.capsule](https://github.com/virionai/capsules-extra/blob/main/capsule-examples/public/data-infra/relational-mini.capsule)
- [knowledge-graph-mini.capsule](https://github.com/virionai/capsules-extra/blob/main/capsule-examples/public/data-infra/knowledge-graph-mini.capsule)
- [nosql-mini.capsule](https://github.com/virionai/capsules-extra/blob/main/capsule-examples/public/data-infra/nosql-mini.capsule)
- [vector-mini.capsule](https://github.com/virionai/capsules-extra/blob/main/capsule-examples/public/data-infra/vector-mini.capsule)

The local check script builds and evaluates the pack. The latest run passed five out of five capsules and fifty-four out of fifty-four checks.

---

## 7. Conclusion

Data infrastructure capsules are not a database replacement. They are a pre-infrastructure collaboration layer. They let small teams and agents carry enough data shape, query logic, and provenance to cooperate before committing to an infrastructure stack.

The pattern is especially useful when the task already depends on a model. If the model must understand the data anyway, the data contract should travel with the work.

---

## Sources reviewed

- MongoDB Vector Search overview: https://www.mongodb.com/docs/vector-search/
- MongoDB Atlas on Google Cloud: https://www.mongodb.com/products/platform/atlas-cloud-providers/google-cloud
- Capsule data infrastructure examples: https://github.com/virionai/capsules-extra/tree/main/capsule-examples/public/data-infra