Nine days of deep harvesting. Twelve clean integrity reports. Cross-jurisdictional coverage stops being a plan and becomes a thing you can query. This is what gets built when you stop adding sources for breadth and commit to depth.
Nine days ago, the compliance intelligence map I was trying to build was mostly blank.
The public number was 9.4 million entities across 51 sources. Impressive on a slide. But when I queried the enforcement depth — the part that actually matters to any compliance buyer — it was thin in the places that matter most.
DOJ press releases: 2,618 records when the real archive goes back decades. FINRA disciplinary: 1,276 records when the actual universe is closer to half a million. SEC enforcement: 9,077 when the combined litigation-release and administrative-proceeding archive is tens of thousands. FinCEN enforcement: 5 records. Five. Not a database. An embarrassment.
And I had zero coverage on anything outside the United States. No UK OFSI. No EU consolidated sanctions. No UN Security Council designations. No Canada SEMA. No AI incident data. No international or algorithmic accountability layer at all. "Five Eyes plus EU plus UN" is a basic sanctions-screening baseline for any serious compliance product, and I had the US portion only.
I also knew, from the witness problem I documented in Signal 015, that I couldn't trust harvester success reports at face value. The SAM.gov harvester had been reporting success for weeks while storing almost nothing. Claimed 970,193 records stored; actually had 16,750 in the database. A 953,443-record lie, buried in the logs. That experience changed how I look at my own numbers.
That's where I started.
The sprint had one rule, and it is the direct inheritance from the SAM lesson: every harvest gets a phantom check. Pre-run row count. Harvest. Post-run row count. The delta has to match the harvester's own claim — independently, in the ground-truth database, not in the agent's self-report. If the numbers don't agree, the run is quarantined and flagged, not silently accepted.
That single operating principle turned out to be what made the rest of the work possible. Not the choice of sources. Not the harvester code. The integrity protocol around each run.
Every one of the twelve harvests that landed this sprint followed the same structure: analyze the source and report, dry-run against the live endpoint and report samples, limited live run with 100–1,000 records, verify the delta against the claim, then the full run inside a detached tmux session with a written completion report dropped to disk whether I was there to read it or not. Four gates per source. No exceptions.
Gate fatigue is real. After the sixth clean harvest, the temptation to skip the dry-run to save twenty minutes is real. Every skipped gate is a potential phantom. Stability over speed. The platform kept serving its usual nine hundred daily visitors while the work happened in the background. Services never went down. No data loss events.
Twelve sources. Twelve clean integrity reports. Zero phantoms. Here is what landed:
| Source | Before | After | Gain | Check |
|---|---|---|---|---|
| SAM.gov Exclusions | 16,750 | 176,152 | 10.5× | MATCH |
| DOJ Press Releases | 2,618 | 265,340 | 101× | MATCH |
| FinCEN Enforcement | 5 | 121 | full archive | MATCH |
| OFAC Consolidated | 18,775 | 19,129 | +sectoral | MATCH |
| UK OFSI Sanctions | 0 | 5,135 | new | MATCH |
| EU FSF Consolidated | 0 | 4,368 | new | MATCH |
| UN Security Council | 0 | 1,005 | new | MATCH |
| Canada SEMA | 0 | 5,487 | new | MATCH |
| DEA Registrant Actions | 0 | 874 | new | MATCH |
| CFTC Enforcement | 0 | 2,745 | new | MATCH |
| SEC Enforcement (LR+AP) | 9,077 | 14,999 | full | MATCH |
| AI Incident Database | 0 | 1,449 | new | MATCH |
The biggest single win was the Department of Justice press release archive. The existing harvester had pulled 2,618 records. The actual Federal Register-indexed archive turned out to be 265,011 records. Not 50,000 as I had estimated. Five times bigger.
When that harvester launched, I went to sleep at roughly 9:30 PM Sacramento time with a tmux session running on the server, a hard-coded deadline to finish before the daily cron at 04:00 UTC, and a written promise to write the completion report to disk whether I was there to read it or not. The harvester woke up at 04:30 UTC, ran for six hours and fifteen minutes, and by the time I checked in the next morning it had ingested 261,692 new DOJ records. The integrity report was clean. 75,497 new entity IDs created. The phantom check matched.
But when I looked at the top entities by record count, the top ten were garbage. "MAN, LLC" with 3,057 records. "Mexican National." "Florida Man." "Federal Jury." The DOJ harvester, like many natural-language extractors before it, was pulling descriptive phrases out of press release titles — "California Man Sentenced to..." — and treating them as entities. The records themselves were real. The extracted entity names were not.
Phantom check passing is necessary but not sufficient. The record count reconciled perfectly. The content still had a quality problem the count couldn't see. This is why you need human review of samples, not just reconciliation of numbers.
I spent the next session doing entity cleanup. Not with a heavy NER pipeline — there wasn't time. With progressively refined regex blocklists. First pass caught 957 junk entities. Second pass caught 3,441. Third pass with permissive patterns caught 20,169 junk entities covering 55% of DOJ records, with zero surname false positives verified against a 54-case unit test that included names like Feldman, Goldman, Tillman.
The records stay in the database because the press releases are real. The entity links just get tagged entity_type='junk' so any query that wants to rank defendants by prosecution frequency can filter them out. A proper pass with Claude-based NER on the raw body text of each press release would push extraction precision to 90–95% and pull out real defendant names. That will happen in a dedicated session.
The principle that stayed consistent: flag, don't delete. The DOJ junk entities stay in the database, queryable and auditable and reversible. If we later determine some of them are real entities, we unflag them. If we had deleted them, they would be gone. Intelligence-grade databases soft-delete and audit. Hard-delete is a last resort.
Before this sprint, CFVA was a US-only platform. That is a significant limitation for any serious compliance use case. A bank compliance officer cannot tell a Russian-connected transaction from a clean one using OFAC alone. They need UK OFSI, EU consolidated, UN Security Council, and whatever allied sanctions regime is relevant to the jurisdiction they operate in.
In two sequential harvest sessions, all four landed.
UK OFSI published list is a single 54MB XML fetch that parses cleanly into 5,135 designated persons and organizations. Heavy on Russia regime (2,413 entries), Syria, ISIL/Al-Qaida, Iran nuclear program, and the Democratic People's Republic of Korea.
UN Security Council consolidated list is smaller at 1,005 entries, heavily overlapping with OFAC and UK designations. When I checked, 941 of the 1,005 UN entities were already in my database from OFAC or OFSI coverage. The 64 net-new entities are the ones unique to UN jurisdiction — Al-Qaida, ISIL, Taliban-linked organizations, HOUTHIS, HAQQANI NETWORK, the specific designations that sit only on the UN register.
Canada SEMA consolidated list added 5,487 records, mostly covering the 2020 Belarus designations and Russia regime persons. EU Consolidated Financial Sanctions List added 4,368 records after deduplication, with approximately 30,000 embedded alias names preserved in raw_data for downstream name-matching.
Combined across six jurisdictions now — US (OFAC SDN plus Consolidated), UK (OFSI), EU (FSF), UN (SCC), and Canada (SEMA) — the total sanctions dataset is roughly 37,000 unique designated entities.
That number might look small next to 265K DOJ press releases or 176K SAM debarments. It is not small. That is approximately the total count of Western sanctions designations as published. LexisNexis's global sanctions data, which they sell for hundreds of thousands of dollars per year to enterprise clients, aggregates 200+ lists including obscure national sanctions and commercial PEP databases; their total count is in the same order of magnitude. I have not replicated LexisNexis. I have covered the Western core.
The most strategic thing I harvested this week was not volume. It was the AI Incident Database maintained by the Responsible AI Collaborative. 1,449 AI incidents, each with a deployer, a developer, an affected parties list, and cross-referenceable company names. License: Creative Commons Attribution Share-Alike 4.0, commercial use permitted, attribution required. I verified the license carefully before ingesting, and excluded the reports collection, which carries original publisher copyright.
Why does this matter? Because it is the first piece of a category that nobody else is building as a queryable intelligence layer.
Search AI compliance companies right now and you find ACA Group, Ascent, CUBE, Compliance.ai, Verafin, and roughly thirty other firms. They all do workflow automation for financial services compliance officers — expert call transcription, eComms surveillance, marketing review, KYC document parsing. That's a crowded category. It is not what I am building.
What I am building is the accountability intelligence layer beneath it. When the Uber self-driving car killed Elaine Herzberg in 2018, that incident lives in AIID. When Uber has federal enforcement history in DOJ or SEC databases, that also lives in CFVA. When the same Uber entities overlap with Google, Microsoft, Navya, and other AI deployers whose incidents are also in AIID — that cross-reference is pre-computed in the database.
Nobody else does this. The AI workflow automation vendors don't have it. LexisNexis doesn't have AI incident data at all. The AI safety research community has individual case studies but no queryable cross-reference infrastructure.
1,449 incidents is the seed. Next steps in the category: Hugging Face model cards for provenance metadata on the million-plus AI models people have deployed, state-level AI laws as the regulatory backdrop (California AB-2013, SB-942, Colorado AI Act, NYC Local Law 144), EU AI Act enforcement tracking as it becomes available, and the AIACP protocol as the reference framework that ties the data to a citable governance standard.
The category I think CFVA ends up owning is AI accountability intelligence. Not workflow automation. Not sanctions screening. The data layer that makes it possible to answer: what AI systems exist, who built them, who deployed them, what harms they have caused, what regulatory responses have followed, and how any of that connects to the broader corporate and enforcement record of the organizations involved. That category does not have a single dominant vendor today. It does not really have a vendor at all.
The value of this sprint isn't any single source. It's what the combined database makes queryable.
Walmart now appears in 16 distinct federal and state sources — SEC, FTC, CFPB, DOJ press releases, OSHA, FDA, EPA, state attorneys general, business registries, and more. 516 records across those 16 sources. Ford appears in 15 sources with 7,681 records. General Motors in 13 sources with 8,945 records. Pfizer in 15 with hundreds of records. Boeing, Lockheed, Raytheon — all in 12 to 15 sources each.
The Fortune 500 now has queryable regulatory footprints in CFVA, pre-computed, not constructed per-query.
These companies will tell you individual hits per query — "Boeing has three OFAC designations" — but they do not pre-compute the cross-agency regulatory profile. The fact that a data platform built by a nursing background with AI as the engineering assistant can produce a view that three multi-billion-dollar incumbent vendors don't is the whole proof-of-concept.
Other cross-references that now work against this data:
967 NPPES-registered active healthcare providers appear on federal exclusion lists (HHS OIG LEIE). Every one of them is a fraud investigation lead. 214 Medicaid-billing providers appear on both state Medicaid enrollment rolls and HHS OIG exclusions — meaning they are still billing federal programs while being formally excluded. This is a qui tam False Claims Act lead without peer.
The AI Incident Database entity names — Uber, Google, Microsoft, Navya, Meta, OpenAI, Tesla — now cross-reference against the federal enforcement history of those same companies. AI harm now has a connected regulatory context.
The sprint ran nine days and not everything worked. A brief audit of the friction points:
Backup proliferation. Every harvest was preceded by a full database backup. By midweek there were four 13GB backups on the 80GB server. Disk dropped to 4GB free, 95% utilization. Moved three superseded backups off the main server, kept the newest as rollback point, returned to 41GB free. The standing rule going forward: full backup only before schema changes, destructive operations, or harvests above 50,000 records. For small harvests, targeted rollback via timestamp-scoped DELETE against the URL unique index is sufficient, because the harvesters are idempotent by design.
HUD OIG blocked by Cloudflare. The HUD Office of Inspector General publishes audit reports and enforcement actions, but their site is behind Cloudflare with aggressive bot detection. Skipped during the automated harvest passes. Asian Development Bank, Inter-American Development Bank, and African Development Bank debarment lists failed the same way. Interpol similarly. These will need a dedicated Playwright-based harvester session where a real browser with a real user agent handles the challenge pages.
OpenSanctions license incompatibility. OpenSanctions aggregates 279,000 sanctioned entities across 200+ lists globally — roughly 7× my sanctions coverage. Exactly what a serious compliance product needs. License is Creative Commons Attribution Non-Commercial. Commercial use blocked. Skipped. This is a real constraint on the product — serious enterprise compliance screening requires 200+ lists, and I don't yet have a commercial-compatible path to most of them.
Entity extraction noise. DOJ, CFTC, and SEC harvested cleanly as records, but their titles don't consistently name defendants. For DOJ, the 20,169 junk entities got flagged. For CFTC and SEC, entity_id was left NULL with the raw title and URL preserved in raw_data for future NER passes. A proper Claude-API NER run across these three sources is the highest-leverage infrastructure task remaining on the roadmap.
The honest competitive position, as of the evening of April 23, 2026:
CFVA is not LexisNexis and will not be LexisNexis in any reasonable timeframe. LexisNexis has a thirty-year head start on PEP data and adverse media, which are the two categories where they cannot be caught quickly. For enterprise bank AML and global due diligence work, they remain the reference platform.
Where CFVA is already useful — today, right now, with what's in the database — is in specific use cases that LexisNexis doesn't optimize for:
Healthcare fraud investigation combining NPPES and Medicaid rolls with HHS OIG and DEA data. Investigative journalism on cross-jurisdictional corporate footprints. AI governance research that needs incident data tied to regulatory history. Mid-tier fintech compliance teams that can't afford $250K/year subscriptions and don't need 200 sanctions lists for their risk profile.
The 18-month window where AI compliance intelligence remains an unpriced category is the real strategic opportunity. The federal enforcement and international sanctions work this week was table stakes for being credible in any of those conversations. The AI side — Hugging Face, AIACP, state and EU AI regulatory data, model provenance — is the moat I'm building next.
Nine days, twelve harvests, twelve clean integrity reports. That ratio isn't an accident. It's the result of an operating discipline worth documenting before the memory of what it felt like fades:
Every harvester is hostile until proven honest. A harvester reporting success does not mean data landed. A harvester writing to its own log is not sufficient proof. The external reconciliation — pre-count, post-count, delta equals claimed — is the integrity primitive. Without it, all downstream claims about the database are unfalsifiable.
Four gates per harvest, always. Analyze, dry-run, limited-run, full-run. No shortcut. Gate fatigue is real — after the sixth harvest in a week, the temptation to skip the dry-run is real — but every skipped gate is a potential phantom.
Small harvests are completeness; large harvests are depth. 121 FinCEN records is not disappointing. It's the full archive. 442 OFAC Consolidated records is not thin. It's the complete sectoral sanctions list. Enforcement data is narrow by nature because governments only take formal action against a small slice of the regulated universe. The number that matters is coverage percentage, not absolute count.
Flag, don't delete. The DOJ junk entities stay in the database with entity_type='junk'. They are queryable, auditable, reversible. Hard-delete is a last resort. Intelligence-grade databases soft-delete and audit.
Time is authentication. Timestamps are not decoration. They are proof. Every harvest_runs row is a truth claim about what happened at a specific moment. The integrity of the audit trail is the credibility of the platform.
The harvest sprint banks a real foundation. What it doesn't produce is revenue or category ownership. Those are the next fights.
Near-term roadmap, in priority order:
Hugging Face model cards. Metadata harvest across the top 15,000 AI models by downloads and trending. Pairs directly with AIID for the AI compliance intelligence layer.
State AI laws database. California AB-2013 and SB-942, Colorado AI Act, NYC Local Law 144, Texas HB 2060, Illinois AI Video Interview Act, plus the tracker of pending state bills. Small dataset, strategic positioning.
Banking enforcement triple. FDIC enforcement decisions and orders, OCC enforcement actions, Federal Reserve enforcement actions. Closes the biggest current gap in financial services compliance coverage.
AIACP infrastructure. robots.txt, sitemap, Zenodo DOI for AIACP v0.3, public GitHub repository, citable framework with comparison matrix against NIST AI RMF, EU AI Act, and ISO 42001.
NER re-pass on DOJ, CFTC, and SEC. Fix the entity_id NULL issue by running Claude-based extraction across the raw press release body text.
FINRA BrokerCheck. The big one. Roughly 500,000 disciplined financial professionals. Nine-day harvest, planned for a European trip where the harvester can run solo without supervision.
Longer term: CourtListener opinion metadata for full federal and state court opinion coverage. State medical boards across all fifty states, aggregated. NMLS mortgage licensing. NASAA state securities regulator actions. EU AI Act enforcement tracking as cases emerge. UK ICO GDPR enforcement.
The list is long. The runway is not infinite. The method from the last nine days — one harvest, four gates, integrity first, flag rather than delete — scales. The category I'm betting on — AI accountability intelligence — is still unoccupied.
I grew up in the former German Democratic Republic. Watched institutions lie. Watched people learn to route around those lies. That background is why I reject the word "manifesto" and use "declaration." It's why I care about accountability infrastructure in a way that treats transparency as non-negotiable rather than as a marketing posture. It's why this platform does not require logins to view entity data and why the Signal journal preserves timestamps without amendment, even when old entries would read better with edits.
CFVA is not a compliance company. Compliance is the commercial use case. The platform itself is an instrument for making corruption and institutional drift harder to hide, built on the conviction that if the data is aggregated honestly and cross-referenced rigorously, the people who need to use it — journalists, investigators, researchers, regulators, whistleblowers, eventually citizens — will find it.
The sprint that just ended pushed the map closer to the shape it has to be to serve that goal. Nine days, twelve clean harvests, the international layer built, the AI layer seeded, the integrity protocol held throughout. The dataset can now answer questions it could not answer last week. That's the progress that matters. The revenue, the customers, the acquisition offers that will come or won't come — those are downstream of whether the underlying work is honest.
The work continues.