Wikimedia Enterprise for Structured Wikipedia Evidence
Motivation
The evidence pipeline (Phase 4b) currently uses the free Wikipedia REST API (en.wikipedia.org/api/rest_v1) which returns only a 2-3 sentence text extract per entity. The AI reasoner (Phase 4c) must parse prose to find dates, numbers, and relationships — a lossy process that misses details buried deeper in the article.
Wikimedia Enterprise provides the same Wikipedia content as structured JSON — parsed infoboxes, tables, and section-level content. Infoboxes are the most data-dense part of any Wikipedia article (birth dates, awards, career stats, founding years) and are exactly what the evidence pipeline needs to verify fact claims.
Current account: 5K on-demand requests/month (~165/day), 1,500 chunk requests, 15 snapshots.
What It Replaces vs Adds
| Capability | Free Wikipedia API | Wikimedia Enterprise |
|---|---|---|
| Article text | 2-3 sentence extract | Full article, section-level |
| Infoboxes | Not available | Parsed as structured JSON |
| Tables | Not available | Parsed as JSON (beta) |
| Structured dates | Parse from prose | Direct field access |
| Structured numbers | Parse from prose | Direct field access |
| Auth | None | JWT bearer token (24h expiry) |
| Rate limits | Polite use | 5K on-demand/month |
Authentication Flow
POST https://auth.enterprise.wikimedia.com/v1/login
Body: { "username": "...", "password": "..." }
→ { "access_token": "...", "refresh_token": "...", "id_token": "..." }
Access tokens expire after 24 hours.
Refresh tokens expire after 90 days.
POST https://auth.enterprise.wikimedia.com/v1/token-refresh
Body: { "refresh_token": "..." }
→ New access_token
Authorization: Bearer {access_token}
Implementation
Challenge 1: Config & Environment
Files: packages/config/src/index.ts, .env.example
- Add
WIKIMEDIA_ENTERPRISE_USERNAME(optional string) - Add
WIKIMEDIA_ENTERPRISE_PASSWORD(optional string) - Add getters:
getWikimediaEnterpriseUsername(),getWikimediaEnterprisePassword() - Credentials already in
.env.local
Acceptance: Config getters return values, env:check-example passes.
Challenge 2: Auth Token Manager
File: packages/ai/src/wikimedia-enterprise-auth.ts (new)
JWT token lifecycle management:
interface WikimediaTokens {
accessToken: string
refreshToken: string
accessExpiresAt: number // Date.now() + 23h (1h buffer before 24h expiry)
refreshExpiresAt: number // Date.now() + 89d
}
// Singleton token state
let tokens: WikimediaTokens | null = null
async function login(): Promise<WikimediaTokens>
async function refreshAccessToken(): Promise<string>
async function getAccessToken(): Promise<string | null>
// Returns valid token, refreshing if needed. Returns null if no credentials configured.
- Auto-refresh:
getAccessToken()checks expiry, refreshes transparently - 1h buffer before expiry to avoid edge-case 401s
- If refresh token expired, re-login with username/password
- Graceful degradation: if credentials missing, return null (caller falls back to free API)
- Metrics:
wikimedia_enterprise.auth_login,wikimedia_enterprise.auth_refresh
Acceptance: Can login, get token, and auto-refresh after simulated expiry.
Challenge 3: On-Demand Article Client
File: packages/ai/src/wikimedia-enterprise-client.ts (new)
- Endpoint:
POST https://api.enterprise.wikimedia.com/v2/articles/{article_name} - Auth:
Authorization: Bearer {token}via token manager - Body:
{ "filters": [{ "field": "project", "value": "enwiki" }] } - In-memory cache: 1h TTL, 5K max entries
- 10s timeout, abort controller
- Metrics:
wikimedia_enterprise.api_calls,wikimedia_enterprise.cache_hit,wikimedia_enterprise.article_found
Key methods:
getArticle(articleName: string): Promise<WikimediaArticle | null>
// Returns full article with structured content
getArticleInfobox(articleName: string): Promise<Record<string, string> | null>
// Convenience: extracts just the infobox key-value pairs
getArticleSections(articleName: string): Promise<WikimediaSection[]>
// Returns section-level content for targeted evidence extraction
Response parsing: The structured content (beta) includes infobox fields as key-value pairs. Parse these into a flat Record<string, string> for easy comparison against fact values.
Fallback: If Enterprise returns null or errors, fall back to existing free Wikipedia client. The evidence pipeline should never be blocked by Enterprise unavailability.
Acceptance: Can fetch "Albert_Einstein" → structured infobox with Born: 14 March 1879, Birthplace: Ulm, Awards: Nobel Prize in Physics (1921).
Challenge 4: Evidence Pipeline Integration
File: packages/ai/src/validation/evidence.ts
Replace the free Wikipedia lookup with Enterprise when available, falling back to free API:
// Upgraded Wikipedia evidence — Enterprise first, free API fallback
let wikiEvidence: string | null = null
let wikiInfobox: Record<string, string> | null = null
if (wikimediaEnterpriseConfigured()) {
const article = await getArticle(entityName.replace(/ /g, '_'))
if (article) {
wikiInfobox = await getArticleInfobox(entityName.replace(/ /g, '_'))
wikiEvidence = formatEnterpriseContext(article, wikiInfobox)
}
}
// Fallback to free API if Enterprise unavailable
if (!wikiEvidence) {
const summary = await lookupWikipedia(entityName)
if (summary) wikiEvidence = summary.extract
}
if (wikiEvidence) {
findings.push(`Wikipedia: ${wikiEvidence}`)
}
// Direct infobox field comparison against fact values
if (wikiInfobox) {
const contradictions = compareInfoboxToFactValues(wikiInfobox, factValues)
for (const c of contradictions) {
flags.push(`wiki_infobox_contradiction: ${c.field}: fact says "${c.factValue}" but infobox says "${c.infoboxValue}"`)
}
}
Infobox comparison logic:
- Extract numeric values from both fact values and infobox
- Date comparison with year-level tolerance
- String comparison for names, places (fuzzy match)
- Flag contradictions as critical evidence
Confidence impact:
- Infobox field matches fact value →
apiConfidence= 0.9 (structured, authoritative) - Infobox field contradicts fact value → flag as critical with specific field cited
- Enterprise article found but no infobox → use full text (still better than free extract)
Acceptance: "Einstein was born in 1879" → Enterprise infobox Born: 14 March 1879 → confirmed. "Einstein was born in 1878" → infobox contradiction flagged.
Challenge 5: Rate Budget Management
File: packages/ai/src/wikimedia-enterprise-client.ts
5K on-demand requests/month = ~165/day. Need to be intentional:
// Track daily usage (in-memory, resets at midnight UTC)
let dailyRequestCount = 0
let dailyRequestDate: string | null = null
const DAILY_BUDGET = 150 // Leave 15/day buffer
function canMakeRequest(): boolean {
const today = new Date().toISOString().split('T')[0]
if (dailyRequestDate !== today) {
dailyRequestCount = 0
dailyRequestDate = today
}
return dailyRequestCount < DAILY_BUDGET
}
When budget exhausted, silently fall back to free Wikipedia API. Log a warning at 80% usage.
Acceptance: After 150 requests in a day, client returns null and logs warning. Free API fallback kicks in.
Challenge 6: Tests
Files:
packages/ai/src/__tests__/wikimedia-enterprise-auth.test.ts(new)packages/ai/src/__tests__/wikimedia-enterprise-client.test.ts(new)
Auth tests:
- Login flow → token storage
- Auto-refresh when access token expired
- Re-login when refresh token expired
- Graceful degradation when credentials missing
Client tests:
- Article fetch and response parsing
- Infobox extraction from structured content
- Cache behavior
- Rate budget tracking and fallback
- Infobox-to-fact comparison logic
Acceptance: bun run test passes.
Migration Path
This doesn't replace the free Wikipedia client — it upgrades it. The free client stays as fallback:
Evidence Phase 4b:
1. Try Wikimedia Enterprise (structured infobox data)
2. If unavailable/budget-exhausted → fall back to free Wikipedia REST API
3. If both unavailable → fall back to Wikidata
Existing packages/ai/src/wikipedia-client.ts remains untouched.
Cost
Free tier account. 5K on-demand/month, 15 snapshots, 1,500 chunks. Resets on 1st of each month.
Dependencies
WIKIMEDIA_ENTERPRISE_USERNAMEin.env.local(already added)WIKIMEDIA_ENTERPRISE_PASSWORDin.env.local(already added)- Add both to
packages/config/src/index.tsenv schema - Add placeholders to
.env.example
Relationship to Other Evidence Plans
| Plan | Domain | Data |
|---|---|---|
| API-Sports | Sports | Match results, player stats, game data |
| OpenAlex + Nobel Prize + NASA | Science, academia, space | Authors, papers, institutions, prize attribution |
| Alpha Vantage | Finance (primary) | Company fundamentals, stock prices |
| FRED + Finnhub + FMP + World Bank | Finance (expansion) | Economic data, ESG, global development |
| Wikimedia Enterprise | All domains (upgrade) | Structured Wikipedia infoboxes, tables, sections |
| DBpedia | General-purpose fallback | Structured Wikipedia infobox properties (overlaps with Enterprise) |
Note: Wikimedia Enterprise largely supersedes the DBpedia plan — both extract structured data from Wikipedia, but Enterprise provides it directly from the source with fresher data and official support. Consider deprioritizing DBpedia if Enterprise integration is successful.