01 — The Number I Was Proud Of
9.7 Million Entities. And a Quiet Problem Underneath.
When you build a platform alone, the metric in the corner of the dashboard is what tells you whether the work was worth it. For months, mine read 9.7 million entities. Companies. Organizations. Persons. Government bodies. Cross-referenced against fourteen million federal records from sixty-plus sources.
It was the number I put in pitches. The number I told friends. The number that made me feel like the platform was real.
It was also the number that had a quiet problem underneath it.
1.65M
Substantive (actual)
Two-thirds of my "entities" had less than two attached records. Google had already noticed — they were quietly noindexing them. I had been telling myself it was a crawl budget issue, a sitemap submission timing problem, anything except the obvious thing.
02 — The Strategic Session in Germany
What You See When You Step Away From the Code
I flew to Germany in late May. Family. Old friends. Cities that built me. I wasn't planning to do strategy work. But on a quiet afternoon — laptop open, tea getting cold — I forced myself to look at the platform the way a buyer would.
A plaintiff lawyer types in "Equifax." What do they get? They get the real Equifax page with a hundred federal records. Good. They also get four other entries with similar names — shells, fragments, partial extractions — that exist because something in the harvest pipeline created them automatically from press release titles.
They don't say "this platform is impressive." They say "this platform looks half-finished."
Most data companies chase the big number. 9.7 million sounds impressive. Investors love big numbers. Marketing loves big numbers. But buyers don't pay for big numbers. Buyers pay for trustworthy data. The number had to come down before the product could come up.
By the end of that session the strategic question reframed itself: the real asset was the 1.65 million genuinely substantive entities — the ones with multiple verified federal records, the ones a buyer would actually search for. Everything else was either junk to remove or thin data waiting to be deepened.
The job changed. Quality over quantity. Now I had to come home and prove I meant it.
03 — The Discovery
The Harvester Had Been Hallucinating Entities for Months
Back in Sacramento. First audit pass on the database. The findings landed in waves, each one worse than the last.
The bulk harvester — the daily orchestration script that pulled from sixty-plus government sources — had a bug. A foundational bug. It was treating document titles as entity names.
| Source |
What the Harvester Saw |
What It Created |
| FEMA |
"Disaster Declaration for Madison County Storms" |
Entity name: "Disaster Declaration for Madison County Storms" |
| Federal Register |
"Notice of Proposed Rulemaking 87 FR 12345" |
Entity name: "Notice of Proposed Rulemaking 87 FR 12345" |
| DOJ |
"Five Defendants Indicted in Drug Trafficking Conspiracy" |
Entity name: "Five Defendants Indicted in Drug Trafficking Conspiracy" |
| DOJ |
"Georgia Thief Pleads Guilty" |
Entity name: "Georgia Thief" |
None of these are entities. They are headlines. Section titles. Document fragments. But the harvester scraped them, created an entity record for each, and then attached the related article as a record to its own fragment.
The pattern had been running for months. The pollution was structural.
04 — The Kill
You Don't Patch a Broken Foundation. You Tear It Out.
The bulk harvester was the original architecture. It was clever. It tried to do everything in one script — sixty sources, daily orchestration, auto-create any entity it didn't recognize, run a chronic twelve-hour cycle. It was a monument to "ship fast." It was also the source of every contaminated entity in the database.
On June 4, the daily cron got commented out. No farewell ceremony. Just a hash mark in front of the line:
// Daily harvest — DISABLED June 4, 2026
# 0 2 * * * /opt/cfai-engine/bulk_harvester.py --daily
The active code path was dead. The pollution couldn't recur. But the historical damage was still in the database — hundreds of thousands of contaminated entities that had to be sorted, audited, and removed without touching the real data sitting next to them.
The cleanup ran in two passes, each with its own discipline.
05 — The Cleanup
Two Hundred and Fifty Thousand Bad Records — Surgically Removed
The first pass — Wave 18c — targeted entities with zero attached records. The cleanest case. If something has no records and a name like "Federal Register Volume 88 Notice 12345," it's not an entity, it's noise.
92,959
Phantom entities (W18c)
55,226
Garbage entities (W18f)
184,737
Cascade records removed
The second pass — Wave 18f — was harder. These were entities with one attached record, which made them look substantive on paper. But the record itself was usually just a DOJ press release headline like "Kansas City Man Sentenced" attached to an entity named "Kansas City Man." Both the entity and the record were garbage. Removing one without the other would leave orphan data.
The first predicate I drafted was wrong. The audit caught it. Real companies were sitting in the deletion set — Marubeni Corporation, Goldman Sachs, several SEC-registered public companies — because the bulk harvester had also mis-typed them as "junk." The discipline at the gate saved them.
A second, narrower predicate ran instead. Ninety-nine point seven percent garbage purity, verified against three hundred random samples before a single row was deleted.
The platform did not get smaller because the work was wrong. The platform got smaller because the foundation was being cleaned. Quality is not what's left after you stop adding. Quality is what's left after you have the discipline to remove what shouldn't be there.
06 — The Standing Rule
The Lesson Written Into the Code
The bulk harvester is dead. The pollution it created is mostly gone. But the actual lesson — the one that has to outlive any single cleanup wave — is now a permanent rule in every new harvester that gets built going forward.
Standing Rule — All Future Harvesters
Match first. Never auto-create entities from scraped text. Match targets must already exist with at least one verified record. Phantom-shape names — headline fragments, document titles, sentence shards — are rejected before they can become entities.
It sounds simple. It is the difference between a real product and a pile of data.
Every new harvest wave built since the rule landed — California Attorney General enforcement, OFAC sanctions delta, SEC Litigation Releases — has shipped clean. Zero new contamination. Each one adds substantive entities, not headline ghosts.
07 — The Number That Matters
What the Platform Actually Is — Today
After the cleanup. After the standing rule. After the new waves built on the discipline that came out of all of it.
1.66M
Sitemap-eligible (rc≥2)
The total dropped. The substantive went up. The product got real.
Tomorrow's sitemap regeneration will include the six hundred sanctions and compliance organizations that came out of the cleanup as side-effect surfacings — entities that were always in the data but invisible behind a broken counter. They are visible now.
This isn't the end of cleanup. There is more to do. There always is. But the foundation under everything is honest now, and the rule that keeps it honest is permanent.
The first instinct of most builders is to add. Add features. Add data. Add the bigger number.
The harder instinct — the one that took me a trip to Germany and a long audit and one quiet decision to comment out a line of cron — is to remove what should never have been there. To choose what the platform is by what it refuses to contain.
Nine point seven million entities was a story I was telling. Nine point five million substantive entities is a product I can sell.
The platform is smaller. The platform is more real. That is the only trade I will take.
Christian Fuhrmann
Founder & CEO, CFAISolutions LLC
Sacramento, California