Pulse SportHub — data topology

The topology

The four stages, walked through

Plane 1 — live data from ESPN · Supabase edge functions on pg_cron

The "what's happening now" data — fixtures, scores, league tables — comes from ESPN's free public API (no key) via three edge functions on pg_cron. Nothing is hand-maintained.

Edge fn	Cron	Writes
`pulse-fixtures-sync`	`0 * * * *` hourly	`hub.matches` (fixtures, scores, dates, status)
`pulse-match-detail-sync`	`* * * * *` every min	`hub.match_events / match_stats / lineups` — in-progress matches only
standings ingest	(operational now)	`hub.standings` — the real ESPN table, ingested, not computed

ESPN adapter capabilities = ['fixtures','liveState','standings']. The provider layer is abstracted (ESPN is the free primary; the dispatcher can fail over to other adapters).

Plane 2 — the scraper · standalone, external, NOT edge functions

Where the scraping happens: a self-contained Python scraper (a pulse-scraper Docker container) on Stevie's local machine / homelab VPS — explicitly not inside Supabase and not a pg_cron job. It is scheduled by local cron / apscheduler.

Is it a one-time scrape? No — it does a one-time historical backfill (everything each source offers, going back as far as available) and then runs daily off-peak delta scrapes (01:00–05:00 UTC). It writes raw data into the isolated scrape.* schema (55 tables), one shape per source:

Source	scrape.* tables	What it holds
StatsBomb	`statsbomb_events/_360_frames/_lineups/_matches`	event-level data, freeze frames, lineups
Understat	`understat_shots/_matches/_players/_teams`	xG + shot coordinates
Transfermarkt	`tm_clubs/_players/_transfers/_market_values/_news…`	transfers, valuations, news
TheSportsDB	`tsdb_teams/_players`	badges + metadata
ESPN / fd.co.uk / OpenLigaDB	`espn_matches` · `fdcouk_matches` · `openligadb_matches`	match results across feeds
World Cup history	`wc_*` (~30 tables)	full WC history (squads, goals, refs…)
Wikidata / Wikipedia	`wikidata_*` · `wikipedia_articles`	cross-reference identifiers + prose

A daily edge function — pulse-scrape-etl (0 6 * * *) — is the bridge: it reads scrape.*, resolves canonical entities into pulse_curated.*, then projects clean rows into hub.teams / hub.players / hub.match_shots. It currently fails with WORKER_RESOURCE_LIMIT (needs batching) — which is why hub.players and hub.match_shots are still empty.

Plane 3 — entity resolution · pulse_curated.*

The same club/player appears under different names+ids in every source. pulse_curated.canonical_clubs / canonical_players hold one canonical id per real entity, carrying every source's id (tm_club_id, statsbomb_team_id, understat_team_id, tsdb_team_id, wikidata_qid…). Ambiguous matches wait in resolution_queue. This is how the messy scrape.* rows get reconciled into one hub.teams row before serving.

Serving — hub.* → the frontend · the only schema the app reads

hub.* is the clean, canonical, public-read layer, exposed via PostgREST. The React frontend (pulse/sporthub) queries it directly with supabase.schema('hub') in football/data/hooks.ts. Current staging contents:

hub table	Holds	Source	Rows
`leagues`	7 competitions + ESPN ids + logos	ESPN / seed	7
`teams`	clubs + real ESPN badges	ESPN + scrape-etl	135
`matches`	full-season fixtures + scores	ESPN (fixtures-sync)	2130
`standings`	real league tables (ingested)	ESPN standings	168
`players`	squads + nationality	scrape-etl	0 ⚠
`match_events / stats / lineups`	live enrichment	match-detail-sync (live-only)	0 ⚠
`match_shots`	shot maps + xG	scrape-etl (Understat)	0 ⚠
`follows / notification_subscriptions`	user follow graph	app	0

hub.matches is the hub's own match header — no foreign key into content.*. That isolation is the whole reason the schema was split out.

Separate — legacy main-pulse · content.* (NOT the SportHub)

Don't confuse these. The original Pulse feed lives in content.*: content.sports_events (fed by pulse-sports-sync-fixtures */5 + pulse-sports-sync-live 30s) and content.pulse_articles (fed by the many pulse-espn-news-* crons). The SportHub was deliberately built as a separate hub.* schema so it owns its data and doesn't entangle with this legacy news/scores feed.