A real-time data pipeline built with Kafka, PySpark, Delta Lake, and MLflow for anomaly detection and streaming observability. Designed for professional certification prep and real-world simulation.
View Source Code on GitHubThis project simulates a production-grade streaming architecture that ingests data from Kafka, processes it using PySpark Structured Streaming, and stores it in Delta Lake. The pipeline enables anomaly detection via MLflow-logged models and demonstrates Spark tuning, observability, and Gold-layer partitioning.
- Apache Kafka (Confluent Cloud) for stream ingestion
- PySpark for ETL and structured streaming
- Delta Lake for scalable storage and auditability
- MLflow for model logging, tracking, and inference
- Databricks Workflows for orchestration
- Spark UI for performance tuning and benchmarking
The project is broken down into multiple layers: Bronze for raw ingestion, Silver for transformation and deduplication, and Gold for scored outputs and monitoring. Inference is performed using a batch-scored Isolation Forest model managed by MLflow.
Includes full pipeline DAG, Delta lineage audit via DESCRIBE HISTORY
, and sample outputs like anomaly score histograms, SQL queries, and Spark task metrics.
This project was developed to simulate a realistic, cloud-ready streaming architecture suitable for real-time anomaly detection and AI model integration using industry best practices.
It supports Databricks certification preparation and portfolio development, and reflects real-world production standards such as:
Bronze: Captures raw clickstream-like events directly from Kafka, including noise and duplication. Stored as an append-only Delta table.
Silver: Applies parsing, deduplication, and type casting. It creates a clean, queryable view of time-ordered user activity with UNIX timestamps and numerical values.
Gold: After batch inference, scored events are written to two Delta tables:
gold_anomaly_predictions
(binary flags) and gold_events_scored
(raw anomaly scores + metadata). Both are partitioned and audit-tracked.
This pipeline architecture mirrors what you'd find in many enterprise environments for use cases such as:
The project showcases the full lifecycle of data and models — from ingestion to monitoring — in a scalable, reproducible, and production-aligned workflow.