Glowbal University Data Ingestion Pipeline
An evidence-first Python ingestion and QA pipeline that standardizes international university data from official sources into product-ready review datasets.
Problem
University data for tuition, deadlines, English requirements, scholarships, rankings, and program details is scattered across official pages, making manual entry slow, inconsistent, and difficult to audit.
Pipeline
- Scale from a 50-university pilot to a 150-university batch across 22 countries
- Crawl only approved official source pages and store evidence with source URL, content hash, and parser version
- Classify evidence quality before extraction so low-quality pages do not produce product facts
- Extract auditable university facts linked back to `evidence_id` and `source_url`
- Generate product profiles, program rows, matching tags, writer context, QA reports, and import-shaped CSVs
- Normalize and match QS ranking rows for downstream profile enrichment
Data Quality
- 100% of extracted facts linked to evidence and source URL for auditability
- Quality gates prevent facts from being generated from weak or unapproved evidence
- Field gap reports identify missing deadlines, application systems, and tuition ranges
- Retry source maps and source repair workflows isolate batch blockers
- QS ranking normalization and matching improved ranking coverage for 147 of 150 universities
- 38 unit tests cover validation, crawling, extraction, ranking matching, profile generation, repair, and export logic
Data Sources
Approved university source URLs, Official admissions pages, Official tuition and scholarship pages, Program catalog pages, QS ranking data, Supabase staging tables
Tech Stack
Python, Supabase, PostgreSQL, Playwright, Serper, OpenAI, Gemini, CSV, pytest
Data Model
ingestion_sources, ingestion_evidence, ingestion_facts, university_profiles, program_rows, matching_tags, qa_reports, import_ready_csvs
Deployment link: Not available yet