Wikimedia Enterprise for Structured Wikipedia Evidence

Motivation

The evidence pipeline (Phase 4b) currently uses the free Wikipedia REST API (en.wikipedia.org/api/rest_v1) which returns only a 2-3 sentence text extract per entity. The AI reasoner (Phase 4c) must parse prose to find dates, numbers, and relationships — a lossy process that misses details buried deeper in the article.

Wikimedia Enterprise provides the same Wikipedia content as structured JSON — parsed infoboxes, tables, and section-level content. Infoboxes are the most data-dense part of any Wikipedia article (birth dates, awards, career stats, founding years) and are exactly what the evidence pipeline needs to verify fact claims.

Current account: 5K on-demand requests/month (~165/day), 1,500 chunk requests, 15 snapshots.

What It Replaces vs Adds

CapabilityFree Wikipedia APIWikimedia Enterprise
Article text2-3 sentence extractFull article, section-level
InfoboxesNot availableParsed as structured JSON
TablesNot availableParsed as JSON (beta)
Structured datesParse from proseDirect field access
Structured numbersParse from proseDirect field access
AuthNoneJWT bearer token (24h expiry)
Rate limitsPolite use5K on-demand/month

Authentication Flow

POST https://auth.enterprise.wikimedia.com/v1/login
Body: { "username": "...", "password": "..." }
→ { "access_token": "...", "refresh_token": "...", "id_token": "..." }

Access tokens expire after 24 hours.
Refresh tokens expire after 90 days.

POST https://auth.enterprise.wikimedia.com/v1/token-refresh
Body: { "refresh_token": "..." }
→ New access_token

Authorization: Bearer {access_token}

Implementation

Challenge 1: Config & Environment

Files: packages/config/src/index.ts, .env.example

  • Add WIKIMEDIA_ENTERPRISE_USERNAME (optional string)
  • Add WIKIMEDIA_ENTERPRISE_PASSWORD (optional string)
  • Add getters: getWikimediaEnterpriseUsername(), getWikimediaEnterprisePassword()
  • Credentials already in .env.local

Acceptance: Config getters return values, env:check-example passes.

Challenge 2: Auth Token Manager

File: packages/ai/src/wikimedia-enterprise-auth.ts (new)

JWT token lifecycle management:

interface WikimediaTokens {
  accessToken: string
  refreshToken: string
  accessExpiresAt: number // Date.now() + 23h (1h buffer before 24h expiry)
  refreshExpiresAt: number // Date.now() + 89d
}

// Singleton token state
let tokens: WikimediaTokens | null = null

async function login(): Promise<WikimediaTokens>
async function refreshAccessToken(): Promise<string>
async function getAccessToken(): Promise<string | null>
// Returns valid token, refreshing if needed. Returns null if no credentials configured.
  • Auto-refresh: getAccessToken() checks expiry, refreshes transparently
  • 1h buffer before expiry to avoid edge-case 401s
  • If refresh token expired, re-login with username/password
  • Graceful degradation: if credentials missing, return null (caller falls back to free API)
  • Metrics: wikimedia_enterprise.auth_login, wikimedia_enterprise.auth_refresh

Acceptance: Can login, get token, and auto-refresh after simulated expiry.

Challenge 3: On-Demand Article Client

File: packages/ai/src/wikimedia-enterprise-client.ts (new)

  • Endpoint: POST https://api.enterprise.wikimedia.com/v2/articles/{article_name}
  • Auth: Authorization: Bearer {token} via token manager
  • Body: { "filters": [{ "field": "project", "value": "enwiki" }] }
  • In-memory cache: 1h TTL, 5K max entries
  • 10s timeout, abort controller
  • Metrics: wikimedia_enterprise.api_calls, wikimedia_enterprise.cache_hit, wikimedia_enterprise.article_found

Key methods:

getArticle(articleName: string): Promise<WikimediaArticle | null>
// Returns full article with structured content

getArticleInfobox(articleName: string): Promise<Record<string, string> | null>
// Convenience: extracts just the infobox key-value pairs

getArticleSections(articleName: string): Promise<WikimediaSection[]>
// Returns section-level content for targeted evidence extraction

Response parsing: The structured content (beta) includes infobox fields as key-value pairs. Parse these into a flat Record<string, string> for easy comparison against fact values.

Fallback: If Enterprise returns null or errors, fall back to existing free Wikipedia client. The evidence pipeline should never be blocked by Enterprise unavailability.

Acceptance: Can fetch "Albert_Einstein" → structured infobox with Born: 14 March 1879, Birthplace: Ulm, Awards: Nobel Prize in Physics (1921).

Challenge 4: Evidence Pipeline Integration

File: packages/ai/src/validation/evidence.ts

Replace the free Wikipedia lookup with Enterprise when available, falling back to free API:

// Upgraded Wikipedia evidence — Enterprise first, free API fallback
let wikiEvidence: string | null = null
let wikiInfobox: Record<string, string> | null = null

if (wikimediaEnterpriseConfigured()) {
  const article = await getArticle(entityName.replace(/ /g, '_'))
  if (article) {
    wikiInfobox = await getArticleInfobox(entityName.replace(/ /g, '_'))
    wikiEvidence = formatEnterpriseContext(article, wikiInfobox)
  }
}

// Fallback to free API if Enterprise unavailable
if (!wikiEvidence) {
  const summary = await lookupWikipedia(entityName)
  if (summary) wikiEvidence = summary.extract
}

if (wikiEvidence) {
  findings.push(`Wikipedia: ${wikiEvidence}`)
}

// Direct infobox field comparison against fact values
if (wikiInfobox) {
  const contradictions = compareInfoboxToFactValues(wikiInfobox, factValues)
  for (const c of contradictions) {
    flags.push(`wiki_infobox_contradiction: ${c.field}: fact says "${c.factValue}" but infobox says "${c.infoboxValue}"`)
  }
}

Infobox comparison logic:

  • Extract numeric values from both fact values and infobox
  • Date comparison with year-level tolerance
  • String comparison for names, places (fuzzy match)
  • Flag contradictions as critical evidence

Confidence impact:

  • Infobox field matches fact value → apiConfidence = 0.9 (structured, authoritative)
  • Infobox field contradicts fact value → flag as critical with specific field cited
  • Enterprise article found but no infobox → use full text (still better than free extract)

Acceptance: "Einstein was born in 1879" → Enterprise infobox Born: 14 March 1879 → confirmed. "Einstein was born in 1878" → infobox contradiction flagged.

Challenge 5: Rate Budget Management

File: packages/ai/src/wikimedia-enterprise-client.ts

5K on-demand requests/month = ~165/day. Need to be intentional:

// Track daily usage (in-memory, resets at midnight UTC)
let dailyRequestCount = 0
let dailyRequestDate: string | null = null
const DAILY_BUDGET = 150 // Leave 15/day buffer

function canMakeRequest(): boolean {
  const today = new Date().toISOString().split('T')[0]
  if (dailyRequestDate !== today) {
    dailyRequestCount = 0
    dailyRequestDate = today
  }
  return dailyRequestCount < DAILY_BUDGET
}

When budget exhausted, silently fall back to free Wikipedia API. Log a warning at 80% usage.

Acceptance: After 150 requests in a day, client returns null and logs warning. Free API fallback kicks in.

Challenge 6: Tests

Files:

  • packages/ai/src/__tests__/wikimedia-enterprise-auth.test.ts (new)
  • packages/ai/src/__tests__/wikimedia-enterprise-client.test.ts (new)

Auth tests:

  • Login flow → token storage
  • Auto-refresh when access token expired
  • Re-login when refresh token expired
  • Graceful degradation when credentials missing

Client tests:

  • Article fetch and response parsing
  • Infobox extraction from structured content
  • Cache behavior
  • Rate budget tracking and fallback
  • Infobox-to-fact comparison logic

Acceptance: bun run test passes.

Migration Path

This doesn't replace the free Wikipedia client — it upgrades it. The free client stays as fallback:

Evidence Phase 4b:
  1. Try Wikimedia Enterprise (structured infobox data)
  2. If unavailable/budget-exhausted → fall back to free Wikipedia REST API
  3. If both unavailable → fall back to Wikidata

Existing packages/ai/src/wikipedia-client.ts remains untouched.

Cost

Free tier account. 5K on-demand/month, 15 snapshots, 1,500 chunks. Resets on 1st of each month.

Dependencies

  • WIKIMEDIA_ENTERPRISE_USERNAME in .env.local (already added)
  • WIKIMEDIA_ENTERPRISE_PASSWORD in .env.local (already added)
  • Add both to packages/config/src/index.ts env schema
  • Add placeholders to .env.example

Relationship to Other Evidence Plans

PlanDomainData
API-SportsSportsMatch results, player stats, game data
OpenAlex + Nobel Prize + NASAScience, academia, spaceAuthors, papers, institutions, prize attribution
Alpha VantageFinance (primary)Company fundamentals, stock prices
FRED + Finnhub + FMP + World BankFinance (expansion)Economic data, ESG, global development
Wikimedia EnterpriseAll domains (upgrade)Structured Wikipedia infoboxes, tables, sections
DBpediaGeneral-purpose fallbackStructured Wikipedia infobox properties (overlaps with Enterprise)

Note: Wikimedia Enterprise largely supersedes the DBpedia plan — both extract structured data from Wikipedia, but Enterprise provides it directly from the source with fresher data and official support. Consider deprioritizing DBpedia if Enterprise integration is successful.