The topology
The four stages, walked through
The "what's happening now" data — fixtures, scores, league tables — comes from ESPN's free public API (no key) via three edge functions on pg_cron. Nothing is hand-maintained.
| Edge fn | Cron | Writes |
|---|---|---|
pulse-fixtures-sync | 0 * * * * hourly | hub.matches (fixtures, scores, dates, status) |
pulse-match-detail-sync | * * * * * every min | hub.match_events / match_stats / lineups — in-progress matches only |
| standings ingest | (operational now) | hub.standings — the real ESPN table, ingested, not computed |
ESPN adapter capabilities = ['fixtures','liveState','standings']. The provider layer is abstracted (ESPN is the free primary; the dispatcher can fail over to other adapters).
Where the scraping happens: a self-contained Python scraper (a pulse-scraper Docker container) on Stevie's local machine / homelab VPS — explicitly not inside Supabase and not a pg_cron job. It is scheduled by local cron / apscheduler.
Is it a one-time scrape? No — it does a one-time historical backfill (everything each source offers, going back as far as available) and then runs daily off-peak delta scrapes (01:00–05:00 UTC). It writes raw data into the isolated scrape.* schema (55 tables), one shape per source:
| Source | scrape.* tables | What it holds |
|---|---|---|
| StatsBomb | statsbomb_events/_360_frames/_lineups/_matches | event-level data, freeze frames, lineups |
| Understat | understat_shots/_matches/_players/_teams | xG + shot coordinates |
| Transfermarkt | tm_clubs/_players/_transfers/_market_values/_news… | transfers, valuations, news |
| TheSportsDB | tsdb_teams/_players | badges + metadata |
| ESPN / fd.co.uk / OpenLigaDB | espn_matches · fdcouk_matches · openligadb_matches | match results across feeds |
| World Cup history | wc_* (~30 tables) | full WC history (squads, goals, refs…) |
| Wikidata / Wikipedia | wikidata_* · wikipedia_articles | cross-reference identifiers + prose |
A daily edge function — pulse-scrape-etl (0 6 * * *) — is the bridge: it reads scrape.*, resolves canonical entities into pulse_curated.*, then projects clean rows into hub.teams / hub.players / hub.match_shots. It currently fails with WORKER_RESOURCE_LIMIT (needs batching) — which is why hub.players and hub.match_shots are still empty.
The same club/player appears under different names+ids in every source. pulse_curated.canonical_clubs / canonical_players hold one canonical id per real entity, carrying every source's id (tm_club_id, statsbomb_team_id, understat_team_id, tsdb_team_id, wikidata_qid…). Ambiguous matches wait in resolution_queue. This is how the messy scrape.* rows get reconciled into one hub.teams row before serving.
hub.* is the clean, canonical, public-read layer, exposed via PostgREST. The React frontend (pulse/sporthub) queries it directly with supabase.schema('hub') in football/data/hooks.ts. Current staging contents:
| hub table | Holds | Source | Rows |
|---|---|---|---|
leagues | 7 competitions + ESPN ids + logos | ESPN / seed | 7 |
teams | clubs + real ESPN badges | ESPN + scrape-etl | 135 |
matches | full-season fixtures + scores | ESPN (fixtures-sync) | 2130 |
standings | real league tables (ingested) | ESPN standings | 168 |
players | squads + nationality | scrape-etl | 0 ⚠ |
match_events / stats / lineups | live enrichment | match-detail-sync (live-only) | 0 ⚠ |
match_shots | shot maps + xG | scrape-etl (Understat) | 0 ⚠ |
follows / notification_subscriptions | user follow graph | app | 0 |
hub.matches is the hub's own match header — no foreign key into content.*. That isolation is the whole reason the schema was split out.
Don't confuse these. The original Pulse feed lives in content.*: content.sports_events (fed by pulse-sports-sync-fixtures */5 + pulse-sports-sync-live 30s) and content.pulse_articles (fed by the many pulse-espn-news-* crons). The SportHub was deliberately built as a separate hub.* schema so it owns its data and doesn't entangle with this legacy news/scores feed.