Research · Medical journal · Local-first multimodal records

Local-first multimodal medical records: a portable file as the source of truth

A multimodal on-device LLM and a cryptographic portable-file format are the right substrate for between-visit patient data. They collapse the three problems most patient-data systems get wrong — vendor lock-in, model-context loss across handoffs, and trust establishment without an identity authority. This article walks the architectural narrative behind the medical-journal example: why each seam looks the way it does, and where it deliberately stops.

2026-05 · Protocol v0.6 · Medical journal · Gemma 4 Good

Section 01

The frame: where does between-visit patient data live today?

Between-visit patient data lives in one of three places. It lives in a vendor's cloud, where the EHR exposes an HL7 FHIR surface and a patient-facing portal that the patient logs into twice a year. It lives in a device-specific journal app — Apple Health, MyTherapy, a glucose-meter companion app — whose data shape was designed by whoever sells the device. Or it lives on the piece of paper the patient brings to the appointment because they didn't trust either of the other two to survive contact with the next clinician.

None of these compose cleanly with an on-device LLM that can interpret a photo of a rash, a voice memo describing a wheeze, or a free-text note typed at 11pm. The frame for this article is narrow: if the LLM is on-device and the patient owns the data, the transport format itself has to be the thing that's portable, signed, and verifiable. Not a SaaS brokering between them. Not a vendor-specific binary blob. A file. With cryptographic provenance the next clinician can check without asking permission.

Section 02

Why the file is the source of truth

In the medical-journal example, the encrypted .capsule on the patient's phone is the source of truth. localStorage holds ephemeral working state and the patient's private keys; everything else — the chain of events, the photo and audio bytes, the clinic's trust anchor — lives in the file. Every log triggers the same cycle: read the .capsule from app storage, decrypt content.enc using the patient's X25519 private key, open the inner ZIP, append the new event with hashes recomputed, re-seal the deterministic ZIP, re-sign the envelope, re-encrypt to [patient.pubkey, clinic.pubkey], write the file back.

This shape composes across the cases that matter. The patient changes devices: import the .pem, open the same file. The patient adds a second clinic at a referral: append clinic_b.pubkey to the recipient set at the next re-seal; the existing chain is preserved, the new clinic can read everything from this point forward, and the previous clinic still holds a copy of what they were already given. The patient holds a long-term record: the file is verifiable months later even if the vendor that built the journaling skill is gone — an outside engineer reading the envelope spec can write a reader. The data outlives every component touching it.

What this costs is honest write amplification. Every log decrypts the whole chain, appends, re-seals, and re-encrypts. For a six-lane journal accumulating a few entries a day, this is not a real cost; for a chain with hundreds of events per hour it would be. v0.6 accepts this for the demo because the cycle is what makes the file canonical. v0.7's parking-lot includes a lazy-seal mode — in-memory chain, periodic flush — that keeps the file-as-source-of-truth contract but amortizes the encrypt step. The tradeoff is explicit, not pretended away.

Section 03

Multimodal capture inside an LLM-mediated flow

The on-device model in the medical-journal example is Gemma 4 E4B running through Edge Gallery. Its job at capture time is small and specific. The patient says "I want to log this rash" or "I want to log this wheeze." The patient takes a photo or records a voice memo. Gemma 4 — vision for photos, audio understanding for voice memos — reads the bytes on-device and produces a structured finding plus one or two evidence-tied follow-up questions. For a photo: {location_hint, color, texture, edge_quality, raised, size_estimate, confidence} and "the edge looks raised on the upper-right; does that side itch more?" For audio: {transcript, classification, classification_confidence, duration_seconds} and "I hear a wheeze on the out-breath; is it worse lying down or after exercise?"

The split between bytes and interpretation is the contract. The bytes of the capture — the JPEG, the M4A — flow into payload/<id>.<ext> in the inner ZIP. The chain event holds a media_path field that is committed to the chain hash, so the original bytes and the entry that references them arrive on the clinician's device with a verifiable link between them. The interpretation — the transcript, the classification, the edge-quality assessment, the suggested color — flows into the chain event payload too, but every LLM-authored field is listed in untrusted_payload_fields. Bytes are patient-authored and verifiable; interpretation is LLM-authored and labeled untrusted at the wire level.

This is why the format committed to a payload/ directory and a chain-event field for media_path, and to untrusted_payload_fields as a chain-event slot rather than a host convention. It's not aesthetic. It's the seam that makes multimodal-LLM-mediated records sound. A foreign reader — the clinician, or a different LLM on a different device months later — can treat each side of the seam appropriately: render the photo, play the audio, surface the transcript, and wrap the transcript with untrusted-content framing when feeding it back to a model.

Section 04

The three-skill architecture on the device

The patient device runs Edge Gallery with Gemma 4 plus three skills, each doing one thing.

medical-journal — orchestrator, logger, sealer

Owns the chain. Calls get_clinic_recipient() from the installed clinic skill at seal time. Encrypts each sealed capsule to the patient and the clinic. Bundles the clinic skill into the exported capsule so the clinician's reader can confirm the round-trip identity. Returns control to Gemma after each log; Gemma decides whether to invoke clinical-probe next based on user intent.

clinical-probe — Gemma 4 intelligence

Three actions. intake_probe asks one or two clarifying questions when the user's free-text doesn't supply enough to populate a log event (text Gemma 4). multimodal_probe reads photo or audio bytes and returns a structured finding plus one or two evidence-tied follow-ups (Gemma 4 vision or audio). export_briefing reads the chain and produces a pre-visit summary; v0.1 ships this as deterministic JS (extending correlateTriggers) until the clinician-skill bridge is ready to consume a richer LLM-authored briefing in v0.7+. Stateless. Every LLM-authored field it returns is marked untrusted at the chain level.

clinic:<short> — trust anchor

Tiny. One action: get_clinic_recipient() returns {name, x25519_pubkey, originator_allowlist, version, issued_at}. All values are baked at build time into skill.json. No runtime network calls. Installed once via prescribe.chat/<short>.

These are sized to fit Edge Gallery's skill model exactly. None has to know about the others' internals. The format ties them together: the chain plus the manifest plus the envelope make it possible for medical-journal to receive a structured finding from clinical-probe and an X25519 recipient pubkey from clinic:<short>, then produce a sealed file the clinician can open offline. Each skill is small; the seam is the file.

Section 05

Trust establishment without an identity authority

There is no global identity registry, no certificate authority, no key server. The clinic publishes its X25519 public key at prescribe.chat/<short> — a static site, immutable JSON, four-character code, 1.6M slots. The patient pastes that URL into Edge Gallery's "Load skill from URL" dialog. The clinic skill is installed once. From that moment forward, every export the patient produces is encrypted to that clinic's key in addition to the patient's own.

Edge Gallery exposes only two skill-install paths — Load skill from URL and Import Local Skill. There are no deep links, no custom URL schemes, no "tap a link, the app opens." The trust step is intentionally manual. Anti-phishing runs in three layers, each surfacing the same identity from a different vantage:

The /about/<short> preview page

A human-readable HTML page served from the same static origin. Clinic name, clinic address, pubkey fingerprint, a copy-to-clipboard button for the install URL. This is the recommended entry point for a first-time patient: review who you're about to install, then copy the URL into Edge Gallery.

Edge Gallery's own install dialog

When Edge Gallery fetches prescribe.chat/<short>, it reads the skill manifest and shows the skill's declared name — for example, "Riverside Dermatology — RX7Q." The patient confirms install against the name displayed by the host, not by the URL.

medical-journal's first-use confirmation

Before the first encryption, medical-journal surfaces clinic name and pubkey fingerprint one more time inside the skill flow. Subsequent uses cache the consent. First use is gated, which is the only use that matters for catching a wrong skill installed once.

The out-of-band assumption is explicit: short codes come from the clinic through the same channels the clinic uses for everything else. A paper card at checkout, an SMS, an email, a QR scanned at the front desk. This isn't pretending to be an identity-binding protocol. It's making the trust step a discrete, visible, three-times-reinforced operation, and surfacing what is being trusted at every layer.

Section 06

The clinician side: one HTML file, no install

The clinician opens reader.html. It is one file. They drag a .capsule into it. They drag the clinic's .pem private key into it. Decryption, signature verification, chain-hash verification, and lane rendering all happen in-page. Photo events render with an inline <img>; audio events render with <audio controls>. The bundled skills/clinic-<short>/skill.json is displayed as the trust anchor: "this capsule was encrypted to clinic name, fingerprint hash, installed by patient at date." If the clinician's own installed clinic skill matches, the reader badges the file as a verified trust anchor; if it doesn't, the reader warns.

The reader bundles a static prep card — candidate triggers ranked by ±24h co-occurrence with high-severity symptoms, a medication-effect panel computing mean severity in the 48h before versus 72h after each medication start, suggested questions drawn from a deterministic template based on practice_focus. All of this is in-page JS. None of it is LLM-generated. That is the point: the universal reader stays zero-install. A Gemma-mediated, chain-grounded pre-visit briefing exists as an optional clinician-skill (Gemma 4 on the clinician's iPad) and is parking-lot for v0.7+.

What this buys architecturally is the floor. The format is not a chat protocol. The clinician does not need a model on their side to get value from the record. The model adds depth — chain-grounded Q&A, confidence-tagged differential, specialty-specific framing — but verification and timeline rendering and severity analytics are the baseline. The baseline runs on any device with a browser, offline, today.

Section 07

What this trades against

HL7 FHIR is interoperability for institutions. Capsule is interoperability for patients. Different center of gravity. FHIR has an industry mandate and works for the data plane between large health systems; Capsule has no industry mandate and doesn't require one to start working in any clinic that opens a .capsule file. They are not competing for the same slot. A FHIR-fluent EHR could ingest a Capsule on the back end if it wanted to; a Capsule does not require an EHR to be useful.

Cloud-mediated AI records — Doximity, Suki, Abridge, the growing layer of clinical-AI vendors — give clinicians LLM analysis with infrastructure scale. Trained models, sustained compute, a vendor-side surface for safety and oversight. Capsule trades infrastructure for portability and audit. The patient holds the record. The LLM ran on the patient's device. Nothing transited a vendor's cloud before the file reached the clinic. For a 2026 multimodal model running on a Pixel 8, the capability floor for "a journalist that turns lived experience into structured entries" is already inside the budget. The tradeoff is no vendor-side oversight on the patient's side of the handoff — which is exactly the point if the patient is the one being journaled.

Paper, phone notes, and vendor-locked patient apps are the third comparison. Capsule is the upgrade path that doesn't sacrifice the patient-ownership property the paper has. A handwritten symptom log is the most portable record in existence and the worst at survival, structure, and verification. A patient-app log is structured but locked. A .capsule is structured, portable, verifiable, and patient-held. That's the slot.

Section 08

What this doesn't claim

Three callouts, in the same posture as v0.6's "what doesn't get fixed" section in the cryptographic redesign article:

It is not a clinical decision support tool

The reader's analytics — ±24h candidate triggers, medication-effect severity deltas — are heuristic surfaces for a clinician to weigh. They are not diagnoses. They do not recommend treatment. The UI says so explicitly at the panel level. The same applies to clinical-probe's structured findings: a Gemma-authored note that an edge "looks raised on the upper-right" is patient-context that helps the clinician's eye land on the right area of the photo, not a dermatological assessment.

It does not establish identity

A capsule signed by pk_ABC is verifiable as having been produced by whoever holds the matching private key. Whether that's actually the patient the clinician thinks it is depends entirely on the out-of-band trust step: the short code came from the right clinic, the clinic gave it to the right patient, the patient's .pem hasn't been shared. The format proves the math. It does not prove the relationship between math and reality.

It is not the only piece

Capsule is a transport format. Real adoption needs at minimum: a clinic willing to issue prescribe.chat codes and to open .capsule files in their workflow, a clinic-side ingestion path that fits the practice's existing intake (a clinician at a laptop with reader.html bookmarked is the floor; an integrated clinic skill is the ceiling), and a clinician-skill where Gemma-mediated chain reading earns its keep. The medical-journal example is one well-instrumented vertical that demonstrates the format works end-to-end. Other verticals — pediatric growth, post-surgical follow-up, allergy/immunology, mental-health journaling — would each adapt the lane set and the analytics, not the format.

Section 09

Status and trajectory

v0.1 of the medical-journal example ships in May 2026, alongside the Kaggle Gemma 4 Good submission. Six lanes — symptom, food, environment, medicine, supplement, nutrient. Multimodal photo and audio capture with Gemma 4 vision and audio understanding on-device. clinical-probe exposing intake_probe and multimodal_probe wired to real on-device Gemma 4 calls. Multi-recipient encryption (patient + clinic) using the v0.6 envelope. prescribe.chat/<short> install via Edge Gallery's URL flow plus an /about/ preview page. End-to-end demo on real devices in airplane mode. Video probe is parking-lot.

v0.7+ parking lot: a clinician-skill (Gemma on the iPad for chain-grounded Q&A and confidence-tagged differential), specialty clinician skills for derm, allergy, GI and general practice, key rotation / revocation / custodial recovery, an encrypted-to-recipient cycle as a stable spec, lazy-seal mode to amortize the per-log encrypt cost, external time anchoring (RFC 3161 / a Rekor-style transparency log).

The format itself locks at v0.6 when a second independent implementation round-trips the signed test vectors bit-identically — the same kill criterion described in the cryptographic redesign article. Adoption is the leading indicator; protocol stewardship is the trailing indicator. The medical-journal example is the first real-world vertical the format is held accountable to; it is not the last.

Conclusion

The seam is the contribution

The architecture is not novel in any single piece. Portable files exist. Signed records exist. On-device multimodal models exist. The contribution is the seam between them: a portable file shape small enough to fit a clinical handoff, signed in a way an outside engineer can verify, multimodal in a way an on-device model can produce and an offline reader can render. Three skills sized to fit Edge Gallery, a trust step that is explicit and three-times-reinforced, and a clinician reader that is one HTML file. The format is the substrate; the medical-journal example is the proof the substrate carries weight.

Specification: Protocol v0.6. Example: medical journal. Reference SDK: capsules.run/sdk. Submission: Gemma 4 Good.