Behind every dashboard and AI feature sits a harder, less visible problem: getting trustworthy data from dozens of operational systems into one governed place, fast and cheaply enough to build on. This enterprise data platform — a self-hosted, open-source lakehouse — ingests data from across the business, refines it through a medallion architecture, and serves it to BI, machine learning, web portals, and downstream applications. Built deliberately on open-source components on the organization's own infrastructure to control cost, avoid lock-in, and keep sensitive data inside the security perimeter.
Architecture
- Secure connectivity — every source connects through a site-to-site VPN, one controlled entry point instead of ad-hoc integrations.
- Two-lane ingestion — a hot path (stream engine + API gateway) for near-real-time events; a cold path (batch / CDC) for historical loads.
- Medallion lakehouse — open Delta Lake over object storage: Bronze (raw) → Silver (cleaned) → Gold (business-ready marts).
- Refinery on Kubernetes — Spark / Polars, Airflow, and dbt; metadata indexed in an OpenMetadata catalog.
- Serving layer — Gold published to fit-for-purpose stores: Microsoft Fabric for BI, Postgres & MS SQL for portals, Redis for low-latency reads.
- Security & ops — Azure Key Vault for secrets, Terraform for infrastructure-as-code, SigNoz for observability.
Key strategic decisions
- Open-source, self-hosted by design — for portability, cost control, and keeping regulated data inside the perimeter.
- Medallion as a contract — Bronze/Silver/Gold are quality gates with tests, so downstream teams trust a Gold mart without re-verifying it.
- One refinery, many serving stores — refine once, publish to fit-for-purpose stores.
- Governance & observability as first-class citizens — catalog, secrets, IaC, and monitoring designed in from the start.