Data sources, traversal methods, and the technical approach for tracing claims to their original point of entry — before any laundering occurs. Research for Andrew, Vanderbilt Owen MBA.
Every existing provenance system — C2PA, blockchain watermarks, domain trust scores — verifies the container, not the claim. They prove the President said it. They confirm Reuters published it. They give CNN a 99/100 trust score. They all give the laundered claim a green light.
The real problem: a Russian bot farm injects a false claim. 72 hours later, a credible journalist repeats it. 24 hours after that, the President repeats it. LLMs scrape the President's statement. Every downstream model treats this as high-confidence fact — because every verification layer evaluated the launderer, not the origin.
The fix requires something none of these systems do: isolate the semantic claim itself, crawl backward through time to find Time Zero, and evaluate the entity that originally injected it.
✓ Did CNN publish this? Yes.
✓ Is the byline cryptographically signed? Yes.
✓ Is this domain high-authority? Yes.
✓ Has the file been tampered with? No.
Result: VERIFIED. Claim accepted as credible fact.
→ What is the semantic claim inside this content?
→ When did this exact claim first appear in the digital record?
→ Who or what injected it at Time Zero?
→ What is the reliability rating of that original source?
Result: ORIGIN DEGRADED. Confidence score: 12%.
| Source | What It Contains | Time Depth | Best For | Fit |
|---|---|---|---|---|
| GDELT Project | Real-time global news event monitoring. Rescans every article at 24h and 7 days to track narrative changes. Connects every person, org, location into a global knowledge graph. | 2013–present | Detecting when a claim entered mainstream news media. Tracking narrative evolution and version changes. Near-duplicate syndication detection. | HIGH |
| Common Crawl | Open archive of the public web. Petabytes of raw crawl data stored as immutable WARC files. 2007–present. Captures the long tail of the internet, not just mainstream news. | 2007–present | Finding early injections of claims on fringe sites, forums, obscure domains. Pre-mainstream appearance detection. | HIGH |
| Internet Archive / Wayback Machine | 1+ trillion archived web pages with temporal snapshots. Stores pages as they appeared at specific dates. PISA architecture tracks temporal validity intervals. | 1996–present | Point-in-time verification. "What did this URL say on this exact date?" Catching retroactive edits and content laundering. | HIGH |
| LLM Ensemble Query | Multiple LLMs (GPT-4, Gemini, Claude, Llama, Mistral) trained on different crawl snapshots with different cutoff dates. Querying all and looking for consensus earliest-known appearance. | Varies by cutoff | Fast heuristic for earliest known appearance before deep traversal. Low cost first-pass. Triangulating via consensus across different training corpora. | MEDIUM Hallucination risk |
| Pushshift Reddit Archive | Full historical Reddit corpus including posts, comments, timestamps, user histories. Critical for tracking early fringe community injection points. | 2005–2023 | Detecting claims that originated in fringe Reddit communities (r/conspiracy, etc.) before going mainstream. Early-warning social injection point detection. | HIGH |
| Twitter/X Historical API | Tweet archives with timestamps. Firehose access for researchers. Critical for tracking viral claim injection via social media. | 2006–present | Tracing exact viral propagation path. Identifying coordinated inauthentic behavior (bot farm amplification patterns). | HIGH |
| Bellingcat Information Laundromat | Open-source tool for detecting content laundering via near-duplicate text and shared infrastructure (IP, SSL cert, analytics IDs across websites). | Variable | Detecting coordinated content laundering networks. Identifying websites with shared ownership masquerading as independent sources. | HIGH |
| ClaimReview / Duke Reporters' Lab | Structured fact-check database. 24,000+ fact-checks tagged with claim, speaker, and verdict. Created by Google + Washington Post in 2015. | 2015–present | Cross-referencing known false claims. If Time Zero is already fact-checked, retrieve that verdict and attach it to the confidence score. | MEDIUM Coverage gaps |
| GDELT Global Knowledge Graph (GKG) | Connects every person, org, location, theme and event in global news into a network. Daily Counts + Graph files. Enables network-level analysis of information propagation. | 2013–present | Mapping the full propagation network of a claim. Who amplified it? When? Through what chain of entities? | HIGH |
| LexisNexis / Factiva | Paywalled archives of news, legal, and business documents. Pre-internet press archive depth. | 1970s–present | Tracing claims that predate the public internet. Pre-web origin injection detection. | MEDIUM Paywalled, costly |
A claim doesn't travel as identical text. "Russian bots influenced the election" becomes "foreign interference shaped the vote" becomes "outside actors manipulated democracy." Same claim, zero string match.
Research finding: diachronic word embeddings reveal that even within a single crisis event (COVID-19), terminology shifted measurably within weeks as non-expert users adopted clinical language in new ways.
The fix: You cannot match on strings. You must match on semantic vectors — the meaning of the claim encoded as a high-dimensional embedding — and set a similarity threshold for equivalence.
Before you can search for a claim, you must decompose it from natural language into a machine-searchable structure:
[Subject] → [Predicate] → [Object]
e.g. [Russia] → [interfered with] → [2016 US Election]
This triple is then embedded as a vector. All traversal searches look for vectors with high cosine similarity to this triple — not exact string matches. Semantic equivalence, not textual equivalence.
Tool: Temporal Knowledge Graph agents (OpenAI Cookbook) create time-stamped triplets and flag invalidations when newer claims contradict earlier ones.
SpotSigs algorithm achieves F1 = 0.87 on real-world news corpora for near-duplicate detection. Uses weighted sampling of rare-term phrases rather than full-text comparison. Scalable to millions of documents.
This is how you catch: same claim, slightly different wording, syndicated across 40 sites.
Bellingcat's Information Laundromat detects laundering via infrastructure overlap — shared IP, SSL certificates, analytics IDs — across seemingly independent sites. Tier 1: shared Adobe Analytics ID. Tier 2: embedded domains. Tier 3: user agent patterns.
This catches coordinated fake-independent networks.
Query GPT-4, Gemini, Claude, Llama2, Mistral simultaneously with: "What is the earliest known origin of the claim that [X]?" Find consensus across models trained on different corpora with different cutoff dates.
Not ground truth — but a fast, low-cost first pass before deep traversal. Flag hallucination divergence as a signal of uncertainty.
The OpenAI Temporal Agents cookbook establishes three types of statements a temporal KG must handle:
Atemporal — Never changes. (Speed of light = 3×10⁸ m/s). No timestamp needed.
Static — True from a specific point forward, never changes. "Person X was appointed CEO on [date]." Requires creation timestamp only.
Dynamic — Evolves over time. "Person X is currently CEO." Requires both creation timestamp AND expiration timestamp. When a contradicting claim appears, the old triple is marked INVALID and linked to the new triple via an explicit invalidation relationship.
This invalidation chain is your audit trail — it shows exactly when a claim was superseded and by what.
Strip the claim from its container (the article, the tweet, the speech). Convert it to a Knowledge Graph Triple: [Subject] → [Predicate] → [Object]. Embed it as a semantic vector using a transformer model (Sentence-BERT or multilingual XML-RoBERTa for cross-language claims). This vector is your search key for all downstream traversal.
Sentence-BERTspaCy NERKG Triple ExtractionBefore expensive deep crawls, query multiple LLMs simultaneously: GPT-4, Gemini, Claude, Llama, Mistral. Ask each: "What is the earliest known instance of the claim that [X]?" Find consensus. Flag disagreements as uncertainty signals. This gives you a candidate Time Zero range to target in the deep traversal — dramatically reducing compute cost.
GPT-4GeminiClaudeLlamaEnsemble ConsensusUsing the candidate Time Zero range from Stage 2 as a boundary, run semantic similarity search (cosine similarity on embeddings, threshold ~0.85) backward through: GDELT (mainstream news, 2013+), Common Crawl (fringe web, 2007+), Wayback Machine (point-in-time page states, 1996+), Pushshift Reddit (social injection, 2005+), Twitter historical API (viral propagation). Find the earliest document whose embedded semantic vector crosses the similarity threshold. That document's timestamp and origin entity is your Time Zero candidate.
GDELTCommon CrawlWayback MachinePushshiftTwitter APIVector Cosine SimilarityOnce Time Zero is confirmed, evaluate the originating entity: How old is this account/domain? What is its historical reliability score (cross-reference ClaimReview / fact-check databases)? Does it show coordinated inauthentic behavior (Bellingcat Laundromat infrastructure overlap)? Does it match known bot farm signatures? Apply the user's Alignment Matrix to THIS entity — not to CNN, not to the President. Produce a Truth Confidence Score and propagate it to every downstream repetition in the chain.
ClaimReview DBBellingcat LaundromatDomain Age APIBot DetectionAlignment MatrixThe critical insight from the research: The Bellingcat Information Laundromat already proves the infrastructure overlap approach works for detecting coordinated laundering networks. GDELT already rescans every article at 24h and 7 days to track how narrative changes. The Temporal Knowledge Graph pattern already handles time-stamped triple invalidation chains. These are not hypothetical — they exist. Your architecture assembles them into a unified pipeline that no one has connected end-to-end yet.
Step 1: Build the Claim Extractor. Input: raw text. Output: KG triple + semantic vector.
Step 2: Build the LLM Ensemble Heuristic. Query 3–5 models. Return consensus Time Zero estimate.
Step 3: Connect to GDELT + Wayback Machine APIs. Run backward semantic search from the LLM-estimated date.
Step 4: Build the Origin Rater. Evaluate the Time Zero entity. Return a confidence score.
Step 5: Visualize the full chain: Time Zero → Propagation Path → Launderer → Current Claim.
Semantic drift threshold: What cosine similarity score counts as "the same claim"? Too high = miss paraphrases. Too low = false positives. This needs empirical tuning on known misinformation cases.
Coverage gaps: Not all of the early fringe web is indexed. Some Time Zero events happened in encrypted channels (Telegram, WhatsApp) with zero archival coverage.
Latency vs. depth: Deep traversal is expensive. The LLM ensemble heuristic is the shortcut — but it can hallucinate. You need a confidence threshold for when to trigger the expensive deep crawl.