Overview - Kafka Delta Lake Streaming Pipeline

Purpose of the Project

This project simulates a production-grade streaming architecture that ingests data from Kafka, processes it using PySpark Structured Streaming, and stores it in Delta Lake. The pipeline enables anomaly detection via MLflow-logged models and demonstrates Spark tuning, observability, and Gold-layer partitioning.

Technologies Used

- Apache Kafka (Confluent Cloud) for stream ingestion
- PySpark for ETL and structured streaming
- Delta Lake for scalable storage and auditability
- MLflow for model logging, tracking, and inference
- Databricks Workflows for orchestration
- Spark UI for performance tuning and benchmarking

Pipeline Overview

The project is broken down into multiple layers: Bronze for raw ingestion, Silver for transformation and deduplication, and Gold for scored outputs and monitoring. Inference is performed using a batch-scored Isolation Forest model managed by MLflow.

Includes full pipeline DAG, Delta lineage audit via DESCRIBE HISTORY, and sample outputs like anomaly score histograms, SQL queries, and Spark task metrics.

Project Goals

This project was developed to simulate a realistic, cloud-ready streaming architecture suitable for real-time anomaly detection and AI model integration using industry best practices.

It supports Databricks certification preparation and portfolio development, and reflects real-world production standards such as:

Cloud-based ingestion (Kafka on Confluent)
Modular ETL stages with Delta Lake versioning
MLflow-tracked model scoring and inference
Structured logging for observability and debugging

Data Flow: Bronze → Silver → Gold

Bronze: Captures raw clickstream-like events directly from Kafka, including noise and duplication. Stored as an append-only Delta table.

Silver: Applies parsing, deduplication, and type casting. It creates a clean, queryable view of time-ordered user activity with UNIX timestamps and numerical values.

Gold: After batch inference, scored events are written to two Delta tables: gold_anomaly_predictions (binary flags) and gold_events_scored (raw anomaly scores + metadata). Both are partitioned and audit-tracked.

Real-World Application

This pipeline architecture mirrors what you'd find in many enterprise environments for use cases such as:

Real-time fraud or anomaly detection in financial transactions
Monitoring user behavior and performance on high-traffic web apps
Automated alerting based on machine learning thresholds
Data lake-powered experimentation with versioned ML artifacts