Back to all work
Data Platform30+ sourcesSelf-hosted

Enterprise Data Platform

A self-hosted, open-source lakehouse that ingests the whole business, refines it through a medallion architecture, and serves BI, ML, and applications.

Role Owned the data strategy & platform architecture

Behind every dashboard and AI feature sits a harder, less visible problem: getting trustworthy data from dozens of operational systems into one governed place, fast and cheaply enough to build on. This enterprise data platform — a self-hosted, open-source lakehouse — ingests data from across the business, refines it through a medallion architecture, and serves it to BI, machine learning, web portals, and downstream applications. Built deliberately on open-source components on the organization's own infrastructure to control cost, avoid lock-in, and keep sensitive data inside the security perimeter.

Architecture

  • Secure connectivity — every source connects through a site-to-site VPN, one controlled entry point instead of ad-hoc integrations.
  • Two-lane ingestion — a hot path (stream engine + API gateway) for near-real-time events; a cold path (batch / CDC) for historical loads.
  • Medallion lakehouse — open Delta Lake over object storage: Bronze (raw) → Silver (cleaned) → Gold (business-ready marts).
  • Refinery on Kubernetes — Spark / Polars, Airflow, and dbt; metadata indexed in an OpenMetadata catalog.
  • Serving layer — Gold published to fit-for-purpose stores: Microsoft Fabric for BI, Postgres & MS SQL for portals, Redis for low-latency reads.
  • Security & ops — Azure Key Vault for secrets, Terraform for infrastructure-as-code, SigNoz for observability.

Key strategic decisions

  • Open-source, self-hosted by design — for portability, cost control, and keeping regulated data inside the perimeter.
  • Medallion as a contract — Bronze/Silver/Gold are quality gates with tests, so downstream teams trust a Gold mart without re-verifying it.
  • One refinery, many serving stores — refine once, publish to fit-for-purpose stores.
  • Governance & observability as first-class citizens — catalog, secrets, IaC, and monitoring designed in from the start.
Why it mattersThe instinct in most organizations is to buy the managed warehouse and move on. Building an open, self-hosted lakehouse instead kept cost and data control in-house and gave every downstream product a portable, governed foundation.
ImpactA single governed source of truth replaced a patchwork of point-to-point integrations and manual extracts; cost and lock-in stayed under control; sensitive data stayed inside the perimeter; and it became the foundation the Customer-360, revenue-intelligence, and AI products all sit on.
Built with
Delta LakeSparkPolarsAirflowdbtKubernetesOpenMetadataMicrosoft FabricPostgreSQLMS SQLRedisTerraformSigNoz