Projects

Each project is structured around the problem, data sources, pipeline design, data model, quality checks, and outcome.

In progress

Glowbal University Data Ingestion Pipeline

An evidence-first Python ingestion and QA pipeline that standardizes international university data from official sources into product-ready review datasets.

Problem

University data for tuition, deadlines, English requirements, scholarships, rankings, and program details is scattered across official pages, making manual entry slow, inconsistent, and difficult to audit.

Pipeline

  • Scale from a 50-university pilot to a 150-university batch across 22 countries
  • Crawl only approved official source pages and store evidence with source URL, content hash, and parser version
  • Classify evidence quality before extraction so low-quality pages do not produce product facts
  • Extract auditable university facts linked back to `evidence_id` and `source_url`
  • Generate product profiles, program rows, matching tags, writer context, QA reports, and import-shaped CSVs
  • Normalize and match QS ranking rows for downstream profile enrichment

Data Quality

  • 100% of extracted facts linked to evidence and source URL for auditability
  • Quality gates prevent facts from being generated from weak or unapproved evidence
  • Field gap reports identify missing deadlines, application systems, and tuition ranges
  • Retry source maps and source repair workflows isolate batch blockers
  • QS ranking normalization and matching improved ranking coverage for 147 of 150 universities
  • 38 unit tests cover validation, crawling, extraction, ranking matching, profile generation, repair, and export logic

Data Sources

Approved university source URLs, Official admissions pages, Official tuition and scholarship pages, Program catalog pages, QS ranking data, Supabase staging tables

Tech Stack

Python, Supabase, PostgreSQL, Playwright, Serper, OpenAI, Gemini, CSV, pytest

Data Model

ingestion_sources, ingestion_evidence, ingestion_facts, university_profiles, program_rows, matching_tags, qa_reports, import_ready_csvs

Deployment link: Not available yet

Ready

Ecommerce Market Batch ETL Pipeline

A daily Airflow pipeline that extracts e-commerce product data, validates and loads market snapshots into PostgreSQL, and supports Metabase dashboards.

Problem

Market analysts need a repeatable way to track competitor prices, ratings, product categories, and pipeline quality over time instead of collecting product data manually.

Pipeline

  • Run the `ecommerce_market_etl` Airflow DAG daily at 00:00 UTC
  • Extract product data with retry, timeout handling, pagination support, and raw JSON staging
  • Validate schema drift and transform product rows with Pandas and Pandera
  • UPSERT clean daily snapshots into PostgreSQL and store rejected rows with reason codes
  • Expose checked-in SQL queries for Metabase product and pipeline dashboards

Data Quality

  • Schema drift validation before transformation
  • Critical-field rejection for missing product IDs and invalid prices
  • Rating clipping and category normalization
  • Idempotent UPSERT on product, source, and extraction date
  • Pipeline audit rows with row counts, status, and data quality score

Data Sources

FakeStore API, Tiki scraper, Raw JSON staging files, Airflow execution metadata

Tech Stack

Python, Apache Airflow, Pandas, Pandera, PostgreSQL, SQLAlchemy, Metabase, Docker, pytest

Data Model

products_market, pipeline_runs, rejected_records, staging/{date}/raw_products.json, archive/YYYY/MM/DD/raw_products.json, sql/queries Metabase cards

Deployment link: Not available yet

In progress

Formula-1 Lakehouse Analytics

An ELT lakehouse analytics platform for Formula 1 historical results, race sessions, telemetry, strategy, and weather insights.

Problem

Formula 1 data is fragmented across historical APIs, session endpoints, and telemetry sources, making reliable race performance and strategy analysis difficult to reproduce.

Pipeline

  • Use Airflow to ingest historical, session, lap, stint, weather, position, and selected telemetry data
  • Load raw API responses into a Bronze layer on MinIO/S3 with source and race/session partitions
  • Transform Bronze data into typed, deduplicated Silver Parquet datasets with PySpark
  • Build Gold analytical marts with dbt for race overview, driver performance, pit stops, tyre strategy, and weather impact
  • Serve Gold datasets through DuckDB and Metabase dashboards

Data Quality

  • Raw Bronze preservation for auditability and reprocessing
  • Canonical identifier mapping across OpenF1 and Jolpica-F1
  • Silver schema checks for race, driver, lap, weather, telemetry, and result datasets
  • dbt tests for not-null keys, uniqueness, accepted values, and metric ranges
  • Airflow logs for source, parameters, output paths, record counts, and errors

Data Sources

OpenF1 API, Jolpica-F1 API, FastF1 Python library, Weather and session endpoints, Raw Bronze lakehouse files

Tech Stack

Python, Apache Airflow, MinIO, PySpark, dbt, DuckDB, Metabase, Parquet, Docker

Data Model

bronze.openf1, bronze.jolpica, silver.races, silver.drivers, silver.sessions, silver.laps, silver.weather, silver.stints, fact_lap, fact_car_telemetry, gold_driver_race_performance, gold_pit_stop_efficiency, gold_tyre_strategy_analysis, gold_weather_impact_analysis, gold_race_overview

Deployment link: Not available yet

Planning

Fintech Sentinel

A real-time transaction monitoring and fraud detection lakehouse for simulated digital banking data.

Problem

Risk teams need trusted, low-latency transaction data to detect fraud patterns, audit historical activity, and serve analytics without moving data between systems.

Pipeline

  • Generate synthetic users, accounts, and transaction inserts in PostgreSQL
  • Capture inserts and updates with Debezium CDC and publish them to Kafka topics
  • Evaluate fraud rules in PySpark Structured Streaming and write enriched events to Bronze Iceberg tables
  • Use Airflow and dbt to build Silver and Gold models for analyst-ready fraud and account metrics

Data Quality

  • CDC completeness checks against source transaction counts
  • Great Expectations validation on Bronze and Silver promotion gates
  • dbt tests for required keys, accepted values, and model relationships
  • SLO checks for 5-second fraud latency and 15-minute Gold freshness

Data Sources

PostgreSQL CDC events, Synthetic PaySim transactions, Users table, Accounts table

Tech Stack

PostgreSQL, Debezium, Kafka, PySpark, Apache Iceberg, MinIO, Nessie, dbt, Airflow, Great Expectations, Trino, Grafana, Docker

Data Model

users, accounts, transactions, bronze.transactions_raw, silver_transactions, silver_accounts_scd2, gold_fraud_by_region, gold_fraud_alerts_15min, gold_ml_features

Deployment link: Not available yet