Skip to content

LogWise - Local Development Setup ​

A complete end-to-end logging system that streams logs from Vector β†’ Kafka β†’ Spark β†’ S3/Athena, with a Spring Boot Orchestrator, Grafana dashboards, and automated cron jobs.

πŸ—οΈ Architecture ​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Vector  │─────▢│  Kafka  │─────▢│  Spark  │─────▢│   S3    β”‚
β”‚ (Logs)  β”‚      β”‚(Stream) β”‚      β”‚(Process)β”‚      β”‚(Storage)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚   Athena    β”‚
                                                  β”‚  (Query)    β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                           β”‚
                                                           β–Ό
                                                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                                  β”‚   Grafana   β”‚
                                                  β”‚ (Dashboard) β”‚
                                                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Components:

  • Vector: Log collection and forwarding
  • Kafka: Message streaming (KRaft mode)
  • Spark 3.1.2: Stream processing and Parquet writing
  • S3: Object storage for processed logs
  • Athena: Query engine for S3 data
  • Grafana: Visualization and dashboards
  • Orchestrator: Spring Boot service for job management
  • MySQL: Database for orchestrator configuration

πŸ“‹ Prerequisites ​

Required ​

  • Docker (v20.10+) and Docker Compose (v2.0+)
  • Make (for convenience commands)
  • AWS Credentials with access to:
    • S3 bucket (read/write)
    • Athena workgroup (query execution)

Note: The setup.sh script will automatically install Docker, Make, and other prerequisites if they're missing (on macOS and Debian/Ubuntu Linux). For other systems, install these manually before running setup.

Optional ​

  • Maven 3.2+ (if building Spark JAR locally)
  • Java 11+ (if building Spark JAR locally)

⚠️ Mandatory: S3 & Athena Setup (Must Complete First) ​

Before proceeding with the Docker setup, you MUST complete the S3 & Athena configuration. This is a required prerequisite as the LogWise stack depends on AWS S3 for log storage and Athena for querying.

Steps to Complete: ​

  1. Follow the S3 & Athena Setup Guide to:

    • Create an S3 bucket with logs and athena-output folders
    • Create an AWS Glue database
    • Create an Athena workgroup
    • Create the application-logs table
  2. Note down the following information (you'll need it for the .env file):

    • S3 bucket name
    • S3 URI for logs (e.g., s3://your-bucket-name/logs/)
    • S3 URI for Athena output (e.g., s3://your-bucket-name/athena-output/)
    • Athena workgroup name
    • Athena database name (typically logs)
  3. Return to this page after completing the S3 & Athena setup to continue with the Docker deployment.

Critical

Do not proceed with the Docker setup until you have completed the S3 & Athena configuration. The setup will fail without proper AWS resources configured.

πŸš€ Quick Start ​

One-Command Setup ​

The easiest way to get started is with our one-click setup script:

bash
cd deploy
./setup.sh

This single command will:

  • βœ… Install prerequisites (Docker, Make, AWS CLI, etc.) if needed
  • βœ… Create .env file from template (.env.example)
  • βœ… Prompt you to fill in AWS credentials
  • βœ… Start all services (Vector, Kafka, Spark, Grafana, Orchestrator, MySQL)
  • βœ… Wait for services to become healthy
  • βœ… Create Kafka topics automatically

That's it! Your LogWise stack will be up and running.

πŸ“Š Accessing Services ​

ServiceURLCredentials
Grafanahttp://localhost:3000admin / admin (default)
Spark Master UIhttp://localhost:18080-
Spark Worker UIhttp://localhost:8081-
Orchestratorhttp://localhost:8080-
Orchestrator Healthhttp://localhost:8080/healthcheck-

βš™οΈ Configuration Details ​

The .env file contains all configuration for the LogWise stack. When you run setup.sh, it automatically creates this file from .env.example. Here are the key configuration sections:

AWS Configuration (Required) ​

bash
AWS_REGION=us-east-1                    # AWS region for S3 and Athena
AWS_ACCESS_KEY_ID=your-access-key       # AWS access key ID
AWS_SECRET_ACCESS_KEY=your-secret-key   # AWS secret access key
AWS_SESSION_TOKEN=                      # Optional: for temporary credentials

S3 Configuration (Required) ​

bash
S3_BUCKET_NAME=your-bucket-name              # S3 bucket for storing processed logs
S3_PREFIX=logs/                         # Prefix/path within the bucket

Athena Configuration (Required) ​

bash
S3_ATHENA_OUTPUT=s3://bucket/athena-output/  # S3 path for Athena query results
ATHENA_WORKGROUP=primary                     # Athena workgroup name
ATHENA_CATALOG=AwsDataCatalog               # Athena data catalog
ATHENA_DATABASE=logwise                     # Athena database name

Kafka Configuration ​

bash
KAFKA_BROKERS=kafka:9092                 # Kafka broker address (default for Docker)
KAFKA_TOPIC=logs                         # Kafka topic name for logs
KAFKA_CLUSTER_ID=9ZkYwXlQ2Tq8rBn5JcH0xA  # Kafka cluster ID (KRaft mode)

Spark Configuration ​

bash
SPARK_MASTER_URL=spark://spark-master:7077  # Spark master URL
SPARK_STREAMING=true                        # Enable Spark streaming
SPARK_MASTER_UI_PORT=18080                  # Spark Master UI port
SPARK_VERSION_MATCH=3.1.2                   # Spark version
HADOOP_AWS_VERSION=3.2.0                    # Hadoop AWS library version
AWS_SDK_VERSION=1.11.375                   # AWS SDK version
MAIN_CLASS=com.logwise.spark.MainApplication      # Spark application main class

Database Configuration ​

bash
MYSQL_DATABASE=myapp                       # MySQL database name
MYSQL_USER=myapp                           # MySQL user
MYSQL_PASSWORD=myapp_pass                  # MySQL password
MYSQL_ROOT_PASSWORD=root_pass              # MySQL root password

Other Configuration ​

bash
ORCH_PORT=8080                             # Orchestrator service port
TENANT_VALUE=ABC               # Tenant identifier

For a complete list of all environment variables, see .env.example in the deploy directory.

πŸ› οΈ Common Commands ​

bash
# Start all services
make up

# Stop all services
make down

# View logs
make logs

# Check service status
make ps

# Stop and remove volumes
make teardown

# Reset Kafka (fix cluster ID issues)
make reset-kafka

⚠️ Troubleshooting ​

Spark Worker Not Accepting Resources ​

Symptom: WARN Master: App requires more resource than any of Workers could have

Solution:

  1. Check worker memory: docker compose logs spark-worker | grep "Starting Spark worker"
  2. Ensure worker has enough memory. The worker needs:
    • Memory for driver + executor + overhead
    • Default: 512m driver + 512m executor = ~1GB minimum
  3. Adjust in .env:
    bash
    SPARK_DRIVER_MEMORY=400m
    SPARK_EXECUTOR_MEMORY=400m
  4. Or increase worker memory limit in docker-compose.yml:
    yaml
    spark-worker:
      mem_limit: 3g

ClassNotFoundException for S3 or Kafka ​

Symptom: java.lang.ClassNotFoundException: org.apache.hadoop.fs.s3a.S3AFileSystem

Solution:

  • The custom Spark Dockerfile includes required JARs:
    • hadoop-aws-3.2.0.jar
    • aws-java-sdk-bundle-1.11.375.jar
    • spark-sql-kafka-0-10_2.12-3.1.2.jar
    • kafka-clients-2.6.0.jar
  • Rebuild the Spark image: docker compose build spark-worker spark-master spark-client

AWS Access Denied (403 Forbidden) ​

Symptom: AccessDeniedException: 403 Forbidden

Solution:

  1. Verify AWS credentials in .env:
    bash
    AWS_ACCESS_KEY_ID=your-key
    AWS_SECRET_ACCESS_KEY=your-secret
    AWS_SESSION_TOKEN=your-token  # If using temporary credentials
    AWS_REGION=us-east-1
  2. Ensure IAM permissions include:
    • s3:GetObject, s3:PutObject, s3:ListBucket on target bucket
    • athena:StartQueryExecution, athena:GetQueryResults (if using Athena)
  3. Restart Spark client: docker compose restart spark-client

Port Conflicts ​

Symptom: Error: bind: address already in use

Solution:

  • Change ports in .env:
    bash
    GRAFANA_PORT=3001
    ORCH_PORT=8081

Kafka Cluster ID Mismatch ​

Symptom: Cluster ID mismatch errors

Solution:

bash
make reset-kafka
make up

Disk Space Issues ​

Symptom: no space left on device

Solution:

bash
# Clean up Docker
docker system prune -a --volumes

# Remove unused images
docker image prune -a

Spark Worker Not Registering ​

Symptom: Worker fails to connect to master

Solution:

  1. Check network connectivity:
    bash
    docker compose exec spark-worker curl http://spark-master:8080
  2. Verify master is running:
    bash
    docker compose logs spark-master | grep "Successfully started service"
  3. Check worker logs:
    bash
    docker compose logs spark-worker | grep -i "error\|exception"

πŸ“ Project Structure ​

logwise/
β”œβ”€β”€ deploy/
β”‚   β”œβ”€β”€ docker-compose.yml       # Main orchestration file
β”‚   β”œβ”€β”€ Makefile                 # Convenience commands
β”‚   β”œβ”€β”€ setup.sh                 # One-click setup script
β”‚   β”œβ”€β”€ grafana/provisioning/    # Grafana dashboards & datasources
β”‚   └── healthcheck-dummy/
β”‚       └── Dockerfile           # Healthcheck test service
β”œβ”€β”€ vector/
β”‚   β”œβ”€β”€ vector.yaml              # Vector configuration
β”‚   └── logwise-vector.desc      # Protobuf descriptor
β”œβ”€β”€ spark/
β”‚   └── docker/Dockerfile        # Spark container image
└── orchestrator/
    β”œβ”€β”€ docker/Dockerfile        # Orchestrator container image
    └── db/init/                 # Database initialization scripts

πŸ” Security Notes ​

  • Never commit .env file - Contains sensitive AWS credentials
  • Use IAM roles in production instead of access keys
  • Enable TLS/SSL for production deployments
  • Restrict network access to services in production

Happy Logging! πŸš€

Released under the MIT License. Version 0.0.1