Kafka to Delta Lake MLflow Pipeline

Kafka → Delta Lake Streaming Pipeline with MLflow

A real-time data pipeline built with Kafka, PySpark, Delta Lake, and MLflow for anomaly detection and streaming observability. Designed for professional certification prep and real-world simulation.

View Source Code on GitHub

Purpose of the Project

This project simulates a production-grade streaming architecture that ingests data from Kafka, processes it using PySpark Structured Streaming, and stores it in Delta Lake. The pipeline enables anomaly detection via MLflow-logged models and demonstrates Spark tuning, observability, and Gold-layer partitioning.

Technologies Used

- Apache Kafka (Confluent Cloud) for stream ingestion
- PySpark for ETL and structured streaming
- Delta Lake for scalable storage and auditability
- MLflow for model logging, tracking, and inference
- Databricks Workflows for orchestration
- Spark UI for performance tuning and benchmarking

Pipeline Overview

The project is broken down into multiple layers: Bronze for raw ingestion, Silver for transformation and deduplication, and Gold for scored outputs and monitoring. Inference is performed using a batch-scored Isolation Forest model managed by MLflow.

Includes full pipeline DAG, Delta lineage audit via DESCRIBE HISTORY, and sample outputs like anomaly score histograms, SQL queries, and Spark task metrics.

Project Goals

This project was developed to simulate a realistic, cloud-ready streaming architecture suitable for real-time anomaly detection and AI model integration using industry best practices.

It supports Databricks certification preparation and portfolio development, and reflects real-world production standards such as:

  • Cloud-based ingestion (Kafka on Confluent)
  • Modular ETL stages with Delta Lake versioning
  • MLflow-tracked model scoring and inference
  • Structured logging for observability and debugging

Data Flow: Bronze → Silver → Gold

Bronze: Captures raw clickstream-like events directly from Kafka, including noise and duplication. Stored as an append-only Delta table.

Silver: Applies parsing, deduplication, and type casting. It creates a clean, queryable view of time-ordered user activity with UNIX timestamps and numerical values.

Gold: After batch inference, scored events are written to two Delta tables: gold_anomaly_predictions (binary flags) and gold_events_scored (raw anomaly scores + metadata). Both are partitioned and audit-tracked.

Real-World Application

This pipeline architecture mirrors what you'd find in many enterprise environments for use cases such as:

  • Real-time fraud or anomaly detection in financial transactions
  • Monitoring user behavior and performance on high-traffic web apps
  • Automated alerting based on machine learning thresholds
  • Data lake-powered experimentation with versioned ML artifacts

The project showcases the full lifecycle of data and models — from ingestion to monitoring — in a scalable, reproducible, and production-aligned workflow.