LangFuse Database Architecture

Concepts

Infrastructure & Deployment

Database

Note

📚 Explanation - This page explains LangFuse’s sophisticated database architecture, including why it uses multiple database systems and how they work together to provide high-performance observability.

Understanding LangFuse Database Architecture

LangFuse uses a sophisticated dual-database architecture designed to handle both operational data and high-volume analytics workloads efficiently. Understanding this architecture is crucial for effective deployment, troubleshooting, and performance optimization.

Dual Database Architecture Overview

LangFuse separates concerns by using two distinct database systems, each optimized for different types of workloads:

graph TB
    subgraph "LangFuse Application Layer"
        WebUI[🌐 Web UI]
        API[🔌 API Server]
        SDK[📦 Python SDK]
    end
    
    subgraph "Database Layer"
        subgraph "PostgreSQL Cluster"
            PG[(🗄️ PostgreSQL<br/>User Management)]
            PGData[👥 Users<br/>🏢 Organizations<br/>📋 Projects<br/>🔑 API Keys<br/>⚙️ Settings]
        end
        
        subgraph "ClickHouse Cluster"
            CH1[(📊 ClickHouse<br/>Replica 1)]
            CH2[(📊 ClickHouse<br/>Replica 2)]
            CH3[(📊 ClickHouse<br/>Replica 3)]
            
            ZK[🐘 ZooKeeper<br/>Coordination Plane]
            
            CHData[📈 Traces<br/>🔍 Observations<br/>⭐ Scores<br/>📊 Analytics]
        end
    end
    
    WebUI --> API
    SDK --> API
    API --> PG
    API --> CH1
    API --> CH2
    API --> CH3
    
    CH1 -.->|Coordinates| ZK
    CH2 -.->|Coordinates| ZK
    CH3 -.->|Coordinates| ZK
    
    PG --> PGData
    CH1 --> CHData
    CH2 --> CHData
    CH3 --> CHData
    
    classDef postgres fill:#336791,stroke:#fff,stroke-width:2px,color:#fff
    classDef clickhouse fill:#FFCC02,stroke:#000,stroke-width:2px,color:#000
    classDef zookeeper fill:#4CAF50,stroke:#fff,stroke-width:2px,color:#fff
    classDef app fill:#2196F3,stroke:#fff,stroke-width:2px,color:#fff
    classDef data fill:#f9f9f9,stroke:#666,stroke-width:1px
    
    class PG postgres
    class CH1,CH2,CH3 clickhouse
    class ZK zookeeper
    class WebUI,API,SDK app
    class PGData,CHData data

Database System Roles

PostgreSQL: Operational Data Store

Purpose: Handles all operational and user management data that requires ACID transactions and relational integrity.

Data Types Stored:

User Accounts - Authentication, profiles, permissions
Organizations - Multi-tenant organization structures
Projects - Project configurations and settings
API Keys - Authentication tokens and access control
System Settings - Application configuration and preferences

Why PostgreSQL:

ACID Compliance - Ensures data consistency for critical user data
Relational Integrity - Maintains complex relationships between users, projects, and permissions
Mature Ecosystem - Well-established tooling and operational practices
Azure Integration - Native Azure Database for PostgreSQL support

ClickHouse: Analytics Data Store

Purpose: Handles high-volume time-series data and complex analytical queries with sub-second response times.

Data Types Stored:

Traces - Complete execution records of AI workflows
Observations - Individual steps and operations within traces
Scores - Quality metrics and evaluation results
Analytics Aggregations - Pre-computed statistics and metrics

Why ClickHouse:

Columnar Storage - Optimized for analytical queries across large datasets
High Compression - Reduces storage costs for time-series data
Parallel Processing - Scales horizontally for complex aggregations
Real-time Ingestion - Handles high-volume trace data efficiently

ClickHouse Replication Architecture

Replication Strategy

LangFuse deploys ClickHouse in a 3-replica configuration to ensure high availability and data durability:

3 ClickHouse Pods: shard0-0, shard0-1, shard0-2
Replication Engine: ReplicatedReplacingMergeTree
Coordination: ZooKeeper ensemble for distributed consensus

ZooKeeper’s Critical Role

What ZooKeeper Does:

Metadata Coordination - Manages table schemas and replication metadata
Consensus Management - Ensures all replicas agree on data operations
Failure Detection - Monitors replica health and handles failovers
UUID Management - Maintains unique identifiers for replication groups

Why ZooKeeper is Essential:

Split-Brain Prevention - Prevents data inconsistencies during network partitions
Automatic Recovery - Enables replicas to resynchronize after failures
Schema Evolution - Coordinates schema changes across all replicas
Performance Optimization - Manages query distribution and load balancing

Replication Process

This diagram shows how data flows through the ClickHouse replication system when new trace data arrives. ZooKeeper coordinates the replication to ensure all replicas stay synchronized:

sequenceDiagram
    participant App as LangFuse App
    participant CH1 as ClickHouse Replica 1
    participant CH2 as ClickHouse Replica 2
    participant CH3 as ClickHouse Replica 3
    participant ZK as ZooKeeper Ensemble
    
    App->>CH1: Insert trace data
    CH1->>ZK: Register operation
    ZK->>CH2: Notify replication needed
    ZK->>CH3: Notify replication needed
    CH2->>CH1: Fetch new data
    CH3->>CH1: Fetch new data
    CH2->>ZK: Confirm replication
    CH3->>ZK: Confirm replication
    ZK->>CH1: All replicas synchronized

Database Schema Structure

ClickHouse Schema Components

The ClickHouse database contains 13 critical tables and views:

Core Replicated Tables (6 tables)

Table	Engine	Purpose
`traces`	ReplicatedReplacingMergeTree	Main trace records for LLM requests
`observations`	ReplicatedReplacingMergeTree	Nested steps/observations within traces
`scores`	ReplicatedReplacingMergeTree	Quality scores and evaluations
`blob_storage_file_log`	ReplicatedReplacingMergeTree	File storage references
`project_environments`	ReplicatedAggregatingMergeTree	Project environment aggregations
`schema_migrations`	ReplicatedMergeTree	Migration version tracking

Analytics Views (3 views)

analytics_observations - Hourly observation statistics
analytics_scores - Hourly score statistics
analytics_traces - Hourly trace statistics

Materialized Views (3 views)

project_environments_observations_mv
project_environments_scores_mv
project_environments_traces_mv

Event Log (1 table)

event_log - Non-replicated local table for system events

Performance Characteristics

Query Performance

PostgreSQL Workloads:

User Authentication - Sub-millisecond response times
Project Management - Simple CRUD operations
Configuration Queries - Cached and optimized

ClickHouse Workloads:

Trace Queries - Complex aggregations across millions of records
Analytics Dashboards - Real-time metrics and visualizations
Time-Series Analysis - Historical trend analysis

Scaling Behavior

PostgreSQL Scaling:

Vertical Scaling - Increase CPU/memory for higher concurrency
Read Replicas - Distribute read queries across replicas
Connection Pooling - Optimize connection management

ClickHouse Scaling:

Horizontal Scaling - Add more replicas for increased query throughput
Sharding - Distribute data across multiple shards (future capability)
Compression - Automatic data compression reduces storage requirements

Operational Considerations

Backup and Recovery

PostgreSQL:

Point-in-Time Recovery - Azure Database for PostgreSQL automated backups
Cross-Region Replication - Geographic disaster recovery
Schema Migrations - Managed through application deployment

ClickHouse:

Replica Redundancy - Data automatically replicated across 3 nodes
ZooKeeper Metadata - Critical for replica coordination and recovery
Incremental Backups - Efficient backup of time-series data

Monitoring and Alerting

Key Metrics to Monitor:

PostgreSQL: Connection count, query performance, disk usage
ClickHouse: Replication lag, query latency, replica health
ZooKeeper: Ensemble health, metadata consistency, leader elections

Common Issues and Prevention

Replication Synchronization Problems

Symptoms: Inconsistent data across UI refreshes, “empty traces” appearing intermittently

Root Cause: UUID mismatches between replicas creating isolated replication groups

Prevention:

Never delete PVCs without proper backup procedures
Monitor replication status regularly
Verify UUID consistency after scaling operations

Memory Exhaustion

Symptoms: ZooKeeper OOM errors, cascading failures across components

Root Cause: Insufficient memory allocation for ZooKeeper coordination tasks

Prevention:

Set appropriate memory limits for ZooKeeper pods
Monitor memory usage trends
Scale resources proactively

Next Steps

Now that you understand the database architecture:

For Deployment:

Azure Deployment Guide - Deploy with proper database configuration
ClickHouse Troubleshooting - Resolve database-specific issues

For Operations:

Deployment Troubleshooting - General deployment issues
User Management - PostgreSQL user operations

For Understanding:

Azure Architecture - Overall infrastructure components
Observability Hierarchy - How data flows through the system

This dual-database architecture provides the foundation for LangFuse’s ability to handle both operational workloads and high-performance analytics at scale.