Apache Spark

Apache Spark is the log processing engine in the LogWise system. It consumes logs from Kafka, transforms them as needed, and writes them to Amazon S3 in partitioned format.

Overview

Spark handles continuous log processing from Kafka, transforms and enriches log data, and writes it to S3 in a partitioned structure optimized for querying with Athena.

Architecture in LogWise

Kafka Topics → Spark Jobs → S3 (Parquet, Partitioned)

Spark handles:

Ingestion: Reads logs from Kafka topics in near real-time
Partitioned Storage: Writes logs to S3 in a hierarchical, time-based partition format
Schema Management: Ensures consistent schema across logs using predefined formats

Key Features

Real-time processing - Consumes logs from Kafka topics continuously in micro-batch and streaming modes
Partitioned storage - Writes logs to S3 in hierarchical partition format
Fault tolerance - Checkpointing ensures no data loss with exactly-once processing

Partitioned Storage in S3

Spark writes logs in partitioned directories for efficient query and retrieval. Partition format:

/env=<env>/service_name=<service_name>/year=<YYYY>/month=<MM>/day=<DD>/hour=<HH>/minute=<mm>/

This structure allows fast filtering based on environment, service, or time ranges when querying with Athena.

Kafka Integration

Consumes logs from Kafka topics created by Vector
Supports automatic topic discovery using regular expressions
Tracks Kafka offsets for reliable exactly-once processing

Integration with Other Components

Kafka - Consumes logs from topics
S3 - Writes processed logs in Parquet format with partition structure
Orchestrator Service - Monitors job health and manages Spark drivers

Requirements and Setup

See the Spark Setup Guide for installation and configuration.

Apache Spark ​

Overview ​

Architecture in LogWise ​

Key Features ​

Partitioned Storage in S3 ​

Kafka Integration ​

Integration with Other Components ​

Requirements and Setup ​