IAM Arc · Deep Research Report

Time-Zero Architecture

Data sources, traversal methods, and the technical approach for tracing claims to their original point of entry — before any laundering occurs. Research for Andrew, Vanderbilt Owen MBA.

Student: Andrew
Problem: Information Laundering
Focus: Semantic Claim Traversal
Sources: 50+
The Problem
Why Every Existing System Fails
The Information Laundering gap — what no current tool addresses

Every existing provenance system — C2PA, blockchain watermarks, domain trust scores — verifies the container, not the claim. They prove the President said it. They confirm Reuters published it. They give CNN a 99/100 trust score. They all give the laundered claim a green light.

The real problem: a Russian bot farm injects a false claim. 72 hours later, a credible journalist repeats it. 24 hours after that, the President repeats it. LLMs scrape the President's statement. Every downstream model treats this as high-confidence fact — because every verification layer evaluated the launderer, not the origin.

The fix requires something none of these systems do: isolate the semantic claim itself, crawl backward through time to find Time Zero, and evaluate the entity that originally injected it.

What Existing Systems Check

✓ Did CNN publish this? Yes.

✓ Is the byline cryptographically signed? Yes.

✓ Is this domain high-authority? Yes.

✓ Has the file been tampered with? No.

Result: VERIFIED. Claim accepted as credible fact.

What Your System Must Check

→ What is the semantic claim inside this content?

→ When did this exact claim first appear in the digital record?

→ Who or what injected it at Time Zero?

→ What is the reliability rating of that original source?

Result: ORIGIN DEGRADED. Confidence score: 12%.

Social Graph
WHO — The Data Source Landscape
Every major temporal corpus available for traversal and when to use each
Source What It Contains Time Depth Best For Fit
GDELT Project Real-time global news event monitoring. Rescans every article at 24h and 7 days to track narrative changes. Connects every person, org, location into a global knowledge graph. 2013–present Detecting when a claim entered mainstream news media. Tracking narrative evolution and version changes. Near-duplicate syndication detection. HIGH
Common Crawl Open archive of the public web. Petabytes of raw crawl data stored as immutable WARC files. 2007–present. Captures the long tail of the internet, not just mainstream news. 2007–present Finding early injections of claims on fringe sites, forums, obscure domains. Pre-mainstream appearance detection. HIGH
Internet Archive / Wayback Machine 1+ trillion archived web pages with temporal snapshots. Stores pages as they appeared at specific dates. PISA architecture tracks temporal validity intervals. 1996–present Point-in-time verification. "What did this URL say on this exact date?" Catching retroactive edits and content laundering. HIGH
LLM Ensemble Query Multiple LLMs (GPT-4, Gemini, Claude, Llama, Mistral) trained on different crawl snapshots with different cutoff dates. Querying all and looking for consensus earliest-known appearance. Varies by cutoff Fast heuristic for earliest known appearance before deep traversal. Low cost first-pass. Triangulating via consensus across different training corpora. MEDIUM
Hallucination risk
Pushshift Reddit Archive Full historical Reddit corpus including posts, comments, timestamps, user histories. Critical for tracking early fringe community injection points. 2005–2023 Detecting claims that originated in fringe Reddit communities (r/conspiracy, etc.) before going mainstream. Early-warning social injection point detection. HIGH
Twitter/X Historical API Tweet archives with timestamps. Firehose access for researchers. Critical for tracking viral claim injection via social media. 2006–present Tracing exact viral propagation path. Identifying coordinated inauthentic behavior (bot farm amplification patterns). HIGH
Bellingcat Information Laundromat Open-source tool for detecting content laundering via near-duplicate text and shared infrastructure (IP, SSL cert, analytics IDs across websites). Variable Detecting coordinated content laundering networks. Identifying websites with shared ownership masquerading as independent sources. HIGH
ClaimReview / Duke Reporters' Lab Structured fact-check database. 24,000+ fact-checks tagged with claim, speaker, and verdict. Created by Google + Washington Post in 2015. 2015–present Cross-referencing known false claims. If Time Zero is already fact-checked, retrieve that verdict and attach it to the confidence score. MEDIUM
Coverage gaps
GDELT Global Knowledge Graph (GKG) Connects every person, org, location, theme and event in global news into a network. Daily Counts + Graph files. Enables network-level analysis of information propagation. 2013–present Mapping the full propagation network of a claim. Who amplified it? When? Through what chain of entities? HIGH
LexisNexis / Factiva Paywalled archives of news, legal, and business documents. Pre-internet press archive depth. 1970s–present Tracing claims that predate the public internet. Pre-web origin injection detection. MEDIUM
Paywalled, costly
Knowledge Graph
WHAT — The Technical Methods
Semantic drift, temporal traversal, and claim extraction — what the research actually shows
The Semantic Drift Problem

A claim doesn't travel as identical text. "Russian bots influenced the election" becomes "foreign interference shaped the vote" becomes "outside actors manipulated democracy." Same claim, zero string match.

Research finding: diachronic word embeddings reveal that even within a single crisis event (COVID-19), terminology shifted measurably within weeks as non-expert users adopted clinical language in new ways.

The fix: You cannot match on strings. You must match on semantic vectors — the meaning of the claim encoded as a high-dimensional embedding — and set a similarity threshold for equivalence.

Knowledge Graph Triples

Before you can search for a claim, you must decompose it from natural language into a machine-searchable structure:

[Subject] → [Predicate] → [Object]

e.g. [Russia] → [interfered with] → [2016 US Election]

This triple is then embedded as a vector. All traversal searches look for vectors with high cosine similarity to this triple — not exact string matches. Semantic equivalence, not textual equivalence.

Tool: Temporal Knowledge Graph agents (OpenAI Cookbook) create time-stamped triplets and flag invalidations when newer claims contradict earlier ones.

Near-Duplicate Detection

SpotSigs algorithm achieves F1 = 0.87 on real-world news corpora for near-duplicate detection. Uses weighted sampling of rare-term phrases rather than full-text comparison. Scalable to millions of documents.

This is how you catch: same claim, slightly different wording, syndicated across 40 sites.

Content Laundromat Signals

Bellingcat's Information Laundromat detects laundering via infrastructure overlap — shared IP, SSL certificates, analytics IDs — across seemingly independent sites. Tier 1: shared Adobe Analytics ID. Tier 2: embedded domains. Tier 3: user agent patterns.

This catches coordinated fake-independent networks.

LLM Ensemble Heuristic

Query GPT-4, Gemini, Claude, Llama2, Mistral simultaneously with: "What is the earliest known origin of the claim that [X]?" Find consensus across models trained on different corpora with different cutoff dates.

Not ground truth — but a fast, low-cost first pass before deep traversal. Flag hallucination divergence as a signal of uncertainty.

Temporal Knowledge Graph — Time Validity Architecture

The OpenAI Temporal Agents cookbook establishes three types of statements a temporal KG must handle:

Atemporal — Never changes. (Speed of light = 3×10⁸ m/s). No timestamp needed.

Static — True from a specific point forward, never changes. "Person X was appointed CEO on [date]." Requires creation timestamp only.

Dynamic — Evolves over time. "Person X is currently CEO." Requires both creation timestamp AND expiration timestamp. When a contradicting claim appears, the old triple is marked INVALID and linked to the new triple via an explicit invalidation relationship.

This invalidation chain is your audit trail — it shows exactly when a claim was superseded and by what.

Generative Graph
WHAT IF — The Recommended Architecture
A concrete 4-stage pipeline for Time-Zero detection and origin rating

The Time-Zero Traversal Pipeline

Stage 1

Claim Extraction — Decompose the natural language

Strip the claim from its container (the article, the tweet, the speech). Convert it to a Knowledge Graph Triple: [Subject] → [Predicate] → [Object]. Embed it as a semantic vector using a transformer model (Sentence-BERT or multilingual XML-RoBERTa for cross-language claims). This vector is your search key for all downstream traversal.

Sentence-BERTspaCy NERKG Triple Extraction
Stage 2

Fast Heuristic Pass — LLM Ensemble Query

Before expensive deep crawls, query multiple LLMs simultaneously: GPT-4, Gemini, Claude, Llama, Mistral. Ask each: "What is the earliest known instance of the claim that [X]?" Find consensus. Flag disagreements as uncertainty signals. This gives you a candidate Time Zero range to target in the deep traversal — dramatically reducing compute cost.

GPT-4GeminiClaudeLlamaEnsemble Consensus
Stage 3

Deep Temporal Traversal — Crawl backward through indexed corpora

Using the candidate Time Zero range from Stage 2 as a boundary, run semantic similarity search (cosine similarity on embeddings, threshold ~0.85) backward through: GDELT (mainstream news, 2013+), Common Crawl (fringe web, 2007+), Wayback Machine (point-in-time page states, 1996+), Pushshift Reddit (social injection, 2005+), Twitter historical API (viral propagation). Find the earliest document whose embedded semantic vector crosses the similarity threshold. That document's timestamp and origin entity is your Time Zero candidate.

GDELTCommon CrawlWayback MachinePushshiftTwitter APIVector Cosine Similarity
Stage 4

Origin Rating — Evaluate the Time Zero source, not the launderer

Once Time Zero is confirmed, evaluate the originating entity: How old is this account/domain? What is its historical reliability score (cross-reference ClaimReview / fact-check databases)? Does it show coordinated inauthentic behavior (Bellingcat Laundromat infrastructure overlap)? Does it match known bot farm signatures? Apply the user's Alignment Matrix to THIS entity — not to CNN, not to the President. Produce a Truth Confidence Score and propagate it to every downstream repetition in the chain.

ClaimReview DBBellingcat LaundromatDomain Age APIBot DetectionAlignment Matrix

The critical insight from the research: The Bellingcat Information Laundromat already proves the infrastructure overlap approach works for detecting coordinated laundering networks. GDELT already rescans every article at 24h and 7 days to track how narrative changes. The Temporal Knowledge Graph pattern already handles time-stamped triple invalidation chains. These are not hypothetical — they exist. Your architecture assembles them into a unified pipeline that no one has connected end-to-end yet.

The POC Build Order

Step 1: Build the Claim Extractor. Input: raw text. Output: KG triple + semantic vector.

Step 2: Build the LLM Ensemble Heuristic. Query 3–5 models. Return consensus Time Zero estimate.

Step 3: Connect to GDELT + Wayback Machine APIs. Run backward semantic search from the LLM-estimated date.

Step 4: Build the Origin Rater. Evaluate the Time Zero entity. Return a confidence score.

Step 5: Visualize the full chain: Time Zero → Propagation Path → Launderer → Current Claim.

The Hardest Engineering Problems

Semantic drift threshold: What cosine similarity score counts as "the same claim"? Too high = miss paraphrases. Too low = false positives. This needs empirical tuning on known misinformation cases.

Coverage gaps: Not all of the early fringe web is indexed. Some Time Zero events happened in encrypted channels (Telegram, WhatsApp) with zero archival coverage.

Latency vs. depth: Deep traversal is expensive. The LLM ensemble heuristic is the shortcut — but it can hallucinate. You need a confidence threshold for when to trigger the expensive deep crawl.

Cited Sources