WALDUS: a curated, decentralized, tamper-evident, hybrid-chain LLM

I’m ending the year with a project that had to be shelved for now: WALDUS.ai, a curated, decentralized, tamper-evident, hybrid-chain LLM. Not because the idea is necessarily bad, but because the investment & economics are brutal. The GPU VRAM, RAM, storage, infrastructure, and time required to train a useful model from scratch (without starting from an existing model) quickly becomes very time consuming & expensive.

On top of that, just before writing this article, I searched around and realized my idea wasn’t as original as I thought, its design overlaps with existing projects. But was a full usable original LLM ever the goal for me?

WALDUS started a couple of months ago as a simple question I couldn’t shake:

What if LLM training data worked more like human learning?

That question formed while I was learning how this new LLM hype actually works behind the scene, like most enthusiasts. And, like every of my other projects, I had to give it a shot.

Idea & Design

First i had to know how most LLMs learn today (and what that means in practice), so after a quick deep dive, to my understanding most modern LLMs are trained in two big phases:

1) Pretraining (the “learn language from lots of text” phase)

A common approach is: show the model a lot of text and train it to predict the next token (where the token = small piece of text). This is a very standard setup for GPT-style models.

Where does the text/data come from? Often it includes publicly available internet data, and many training runs also include licensed or otherwise specially sourced datasets. Even OpenAI’s GPT report describes using both public internet-sourced data and licensed data.

2) Post-training (the “make it follow instructions” phase)

After pretraining, many LLMs are tuned to be more helpful and safe. One well-known method is RLHF (reinforcement learning from human feedback), which is part of the broader family of post-training methods used to shape behavior after the big pretraining stage.

Tokenization: [quiet but important]
Before training, text is usually converted into tokens using a tokenizer. A common approach (used by many transformer models) is Byte Pair Encoding (BPE) or similar tokenizers.

So what’s the downside of the usual approach?
It’s not that it’s “wrong.” Especially because it's very successful as we can see. It’s that once you mix huge amounts of data, shuffle it, and bake it into weights, it becomes hard to answer questions like:

Where did this piece of knowledge come from?
Who added it?
Was it approved or curated?
Was it later found to be wrong or harmful?

It becomes hard to say where a specific idea came from once everything is mixed and shuffled.

Dataset “changes” aren’t naturally recorded as a clear, permanent timeline.

If you want a curriculum (a controlled sequence of knowledge), you have to build it outside the core training substrate.

The design for WALDUS was to make those missing pieces first-class & transparent.

WALDUS

should learning have a supply chain?

WALDUS treats training data like a "software artifact":

it has an identity,
it has an author/source,
it can be reviewed,
it can be included in a (sub/fork)release,
and you can reproduce exactly what went into a training run.

The v0-core-spec goal was a system where:

contributors upload training materials intentionally,
curators approve or reject inclusion,
dataset releases are versioned,
training runs can reference a specific release,
and integrity is verifiable even if the data ever gets hosted by many independent community nodes.

Hybrid-chain: blobs for bytes, chain for history

The core design decision is splitting “data” and “truth/verification” into different layers.

1) P2P blobs: the data plane (the “library”)

Training materials live as blobs in a content-addressed store. That means:

The blob ID is derived from its content (hash).
If anyone changes the content, the ID changes.
Clients can verify they received the correct bytes.

In WALDUS v0, blobs can be chunked and described by a descriptor (chunk size, chunk hashes, and a root hash). This makes it easy for relay nodes to distribute content and for clients to verify integrity chunk-by-chunk.

This is how the system becomes decentralized:

anyone can host blobs,
anyone can mirror blobs,
and clients don’t have to “trust” the host, only the hash.

2) The chain: the integrity and decision plane (the “journal”)

The chain is an append-only log of events:

who submitted what,
who approved what,
what dataset release includes what,
what training run used what,
what was revoked later.

This is the important nuance:
A blockchain-style ledger is best thought of as tamper-evident and append-only, not magic “nothing can ever change.” It’s hard to rewrite history without detection and/or major cost under the chain’s security assumptions. That’s exactly what we want.

WALDUS uses the chain to record commitments, not the data.

Why this feels more like human learning

1) It’s guided, not just massive

Humans don’t learn by trying to read everything that's ever written. They learn from:

curated sources
teachers
syllabi
institutions
peers

WALDUS was designed around manifests and curation, so “what the model learns” is a deliberate, reviewable choice.

That lets you say:

“This model was trained on dataset release X.”
“Release Y added these curated sources and excluded these artifacts.”

2) It respects “who taught this?”

People care where knowledge comes from:

citations
authors
reputations
licenses

WALDUS records authorship and identity at the protocol level. That enables:

attribution
license tracking
contributor reputation
accountability

3) It keeps a memory of change

Humans update beliefs over time, but history remains:

old editions exist
retractions exist
corrections exist

WALDUS uses revocation and new releases instead of silent edits. That makes knowledge evolution visible and auditable.

People don’t erase old textbook editions; they release a new edition and explain changes.

the bad artifact still exists in history,
but training clients must/choose to refuse it if it’s revoked with a “training_disallow” scope.

That matches the real world: knowledge evolves, and the trail matters.

Why the hybrid-chain is a good fit (and what it isn’t)

WALDUS does not store training data on-chain. That would be expensive and slow.

Instead it uses a chain for what chains are good at:

append-only event history
signed actions tied to identities
distributed verification
tamper-evident provenance

And it uses P2P content-addressed blobs for what P2P systems are good at:

distributing large content
replication by communities
integrity checks through hashes

That hybrid approach is the point: the chain anchors truth; the blob p2p network carries the data.

first in the chain today

Type examples:

“waldus:artifact.create” -> submitted blob with metadata
“waldus:artifact.curate” -> a curator approved/processed it
“waldus:artifact.revoke” -> marked as not allowed
“waldus:manifest.create” -> a named dataset release (“v0.1.0”)
“waldus:run.create / run.attest” -> training runs and checkpoint proofs

I’m happy I built the first part of WALDUS and learning a lot from the process. Maybe one day i might release a public demo of https://waldus.ai ;)

Proost,