In progress

Glowbal University Data Ingestion Pipeline

In-progress case study for an evidence-first university data ingestion and QA pipeline that turns official university source pages into auditable product review datasets.

PythonSupabasePostgreSQLPlaywrightSerperGeminiCSVpytest

Problem

Glowbal needs reliable international university data for product profiles, matching, and review workflows. The source information is fragmented across official admissions pages, tuition pages, scholarship pages, program catalogs, ranking sources, and application requirement pages.

Manual entry creates two risks: data can be copied without enough context, and reviewers cannot easily trace a profile field back to the page that supports it. This project treats the problem as a data ingestion and QA pipeline, not only a crawler. The goal is to collect approved evidence, extract only supported facts, and produce review-ready datasets without writing directly into the production-facing university table.

Data Sources

Approved official university source URLs across 6 source types.
Official admissions, tuition, scholarship, program, English requirement, and application pages.
QS ranking rows used for ranking normalization and profile matching.
Supabase staging tables for ingestion outputs.
Generated QA reports, field gap reports, retry maps, and import-shaped CSVs.

Architecture

The pipeline follows an evidence-first ingestion pattern:

Approved source URLs
  -> Python crawler
  -> evidence rows with source URL, content hash, and parser version
  -> content quality classification
  -> fact extraction from usable evidence
  -> product profile and program generation
  -> QA reports, matching tags, writer context, and import CSVs
  -> Supabase staging schema

The design separates staging from product-facing data. A 5-table Supabase staging schema holds ingestion sources, evidence, extracted facts, generated profiles, and QA/import outputs while keeping the production-facing public.universities table untouched.

Pipeline Design

The pipeline started with a 50-university pilot and scaled to a 150-university batch across 22 countries.

The ingestion flow is:

Use approved source maps so the crawler only visits reviewed official URLs.
Crawl source pages and store source text with URL, parser version, and content hash.
Classify content quality before extraction.
Extract structured university facts only from evidence that passes the quality gate.
Link every fact back to evidence_id and source_url for auditability.
Generate university profiles, program rows, matching tags, writer context, QA reports, and import-shaped CSVs.
Normalize QS ranking rows and match rankings back to university profiles.
Produce field gap and repair reports so blockers are visible before product import.

The CLI is dependency-light and uses 19 commands built mostly with the Python standard library. Optional workflows use Playwright, Serper, OpenAI, and Gemini for crawling, discovery, and assisted extraction.

Data Model

The pipeline outputs staging and product-review datasets instead of directly mutating application tables:

ingestion_sources: approved source records with university identity, source type, source URL, and crawl metadata.
ingestion_evidence: fetched page content with parser version, content hash, quality classification, and source trace.
ingestion_facts: extracted facts linked to evidence_id and source_url.
university_profiles: generated product-ready profile records for review.
program_rows: normalized program-level rows for downstream matching and product import.
matching_tags: derived tags used to support university matching and discovery.
qa_reports: batch readiness, quality score, missing field, and repair reports.
import_ready_csvs: output files shaped for downstream product review/import.

Data Quality Checks

Source approval gate so extraction starts from known official URLs.
Evidence quality classification before fact extraction.
Required traceability: every extracted fact must link back to evidence_id and source_url.
Field gap reports for missing deadline_summary, application_system, and tuition_usd_min/max.
Batch readiness scoring to separate usable profiles from profiles that need repair.
Retry source maps and source repair workflows for weak or failed source coverage.
QS ranking normalization and matching checks.
Unit tests across validation, crawling, extraction, ranking matching, profile generation, source repair, and export logic.

Current Results

The 150-university batch produced:

2,400 approved source pages across 6 source types.
1,935 usable evidence rows after content quality classification.
23,590 extracted university facts.
100% of facts linked back to evidence_id and source_url.
150 product-ready university profiles.
6,263 program rows.
Matching tags, writer context, QA reports, and import-shaped CSVs.
1,504 processed QS ranking rows.
QS ranking updates or matches for 147 of 150 universities.
0.81 average data quality score.
0.95 matching readiness rate.
134 of 150 profiles at internal-preview QA status.

Challenges

The main challenge is avoiding fabricated or unsupported product data. Tuition, deadline, application system, scholarship, and English requirement fields often appear on different pages, and some official pages are incomplete or hard to parse. The pipeline therefore treats missing data as a QA signal instead of filling weak defaults.

Another challenge is batch readiness. Some profiles can be internally reviewed, while others need source repair before import. The project handles this by producing QA reports, field gap reports, and retry maps so reviewers can see exactly which universities and fields are blocked.

Current Progress

The project is in progress as a production-oriented staging pipeline with QA gates and repair workflow. It is not positioned as full production automation. Current blockers include required source coverage around 0.75 and missing fields such as deadline_summary, application_system, and tuition_usd_min/max.

The current test suite has 38 passing unit tests covering validation, crawling, extraction, ranking matching, profile generation, source repair, and export logic.

Outcome

This project demonstrates data engineering work around ingestion reliability, evidence traceability, quality gates, staging schema design, product-ready exports, and reviewer workflow support. The strongest outcome is not that pages were crawled, but that extracted university data can be audited back to official evidence before it reaches downstream product review.