Deployment Troubleshooting
Deployment Troubleshooting
Problem: You’re encountering errors during LangFuse deployment or post-deployment configuration.
Solution: This comprehensive troubleshooting guide covers the most common deployment issues with specific solutions based on real deployment experience.
Quick Fix: Most Azure deployment issues are timing-related and resolve with a simple retry after waiting 30-60 seconds.
Common Deployment Issues
Permission Errors
Error Pattern: StatusCode=403
or permission-related failures
Key Vault Certificate Permissions
Error: Status=403 Code="Forbidden" Message="The user, group or application ... does not have certificates get/delete permission on key vault"
Root Cause: This occurs when:
- Terraform needs to create/recreate/update certificates
- The service principal lacks certificate management permissions
- Key Vault RBAC model requires explicit permission assignment
- New Key Vault created without proper access policies
Universal Solution:
# Get the Key Vault name from current deployment
KEYVAULT_NAME=$(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.langfuse") | .resources[] | select(.type=="azurerm_key_vault") | .values.name')
# Add all certificate permissions to current user
az keyvault set-policy --name $KEYVAULT_NAME --upn $(az account show --query user.name -o tsv) \
--certificate-permissions create delete get list update import backup restore recover purge
# Verify permissions were added
az keyvault show --name $KEYVAULT_NAME --query "properties.accessPolicies[?objectId=='$(az ad user show --id $(az account show --query user.name -o tsv) --query objectId -o tsv)'].permissions.certificates"
Alternative: Manual Permission Assignment
# If the above doesn't work, use your specific email
az keyvault set-policy --name your-keyvault-name --upn your-email@justice.gov.uk \
--certificate-permissions create delete get list update import backup restore recover purge
After Adding Permissions:
sleep 30
terraform apply -auto-approve
General Permission Propagation
Error: Various StatusCode=403
errors
Root Cause: Azure permissions take time to propagate across services
Solution: Wait for 30-60 seconds and retry.
Key Vault Issues
Soft-Delete Conflicts
Error: Vault name 'your-vault-name' is already in use
or similar naming conflicts
Root Cause: Azure Key Vault has soft-delete enabled by default. When Terraform recreates a Key Vault (e.g., due to domain changes), the old vault enters a “soft-deleted” state for 7 days.
Prevention: Before changing domains that affect Key Vault names:
# 1. Get current Key Vault name from Terraform state
CURRENT_VAULT=$(terraform show -json | jq -r '.values.root_module.child_modules[] | select(.address=="module.langfuse") | .resources[] | select(.type=="azurerm_key_vault") | .values.name')
# 2. Delete the Key Vault before applying changes
az keyvault delete --name $CURRENT_VAULT --resource-group <YOUR_RESOURCE_GROUP>
# 3. Check for any soft-deleted conflicts
az keyvault list-deleted --subscription $(az account show --query id -o tsv)
# 4. Apply Terraform changes
terraform apply -auto-approve
Post-Conflict Resolution: If you encounter the error after deployment:
# Check for soft-deleted vaults
az keyvault list-deleted --subscription $(az account show --query id -o tsv)
# If there's a naming conflict, either:
# Option 1: Wait 7 days for automatic purge
# Option 2: Change your domain/name to avoid conflict
# Option 3: Use different random suffix (remove from state)
terraform state rm module.langfuse.random_string.key_vault_postfix
terraform apply -auto-approve
Key Vault Recovery Issues
Error: Vault name 'your-vault-name' is already in use
Solution 1: Force new Key Vault name
terraform state rm module.langfuse.random_string.key_vault_postfix
terraform apply
Solution 2: Purge existing vault
az keyvault purge --name your-vault-name --location westeurope
Naming and Resource Conflicts
Resource Already Exists
Error: name.*already exists
or similar conflicts
Root Cause: Resource names must be globally unique or conflict with soft-deleted resources
Solution: Use different names or clean up existing resources
# Check for existing resources
az resource list --resource-group <YOUR_RESOURCE_GROUP> --output table
# Delete conflicting resources if safe
az resource delete --resource-group <YOUR_RESOURCE_GROUP> --name conflicting-resource-name --resource-type Microsoft.KeyVault/vaults
Application Gateway Issues
Redirect Configuration Conflicts
Error: ConflictError.*redirectConfigurations
or AGIC-related errors
Root Cause: Application Gateway Ingress Controller (AGIC) configuration conflicts
Solution: Refresh and reapply
terraform refresh
terraform apply -auto-approve
Backend Health Issues
Error: 502 Bad Gateway errors
Diagnostic Steps:
# Check AKS pods
kubectl get pods -n langfuse
# Check ingress configuration
kubectl get ingress -n langfuse -o yaml
# Check Application Gateway backend health
az network application-gateway show-backend-health --resource-group <YOUR_RESOURCE_GROUP> --name lng-appgw
DNS and SSL Issues
DNS Propagation Delays
Problem: Domain not resolving after deployment
Solution: Wait and verify DNS propagation
# Test NS delegation
nslookup -type=NS <YOUR_DOMAIN>
# Test A record resolution
nslookup <YOUR_DOMAIN>
# Wait 5-30 minutes for global propagation
SSL Certificate Issues
Problem: Browser shows “Not Secure” or certificate warnings
Diagnostic Steps:
# Check certificate configuration
kubectl get certificate -n langfuse
# Check ingress TLS configuration
kubectl get ingress langfuse -n langfuse -o yaml | grep -A 5 tls
# Test SSL certificate
openssl s_client -connect <YOUR_DOMAIN>:443 -servername <YOUR_DOMAIN>
AKS and Kubernetes Issues
Pod Startup Failures
Problem: LangFuse pods not starting
Diagnostic Steps:
# Check pod status
kubectl get pods -n langfuse
# Check pod logs
kubectl logs -n langfuse deployment/langfuse-web --tail=50
# Check events
kubectl get events -n langfuse --sort-by='.lastTimestamp'
Resource Constraints
Problem: Pods pending or resource-related errors
Solution: Check and adjust resources
# Check node resources
kubectl describe nodes
# Check resource quotas
kubectl describe resourcequota -n langfuse
# Scale nodes if needed (via Terraform)
# Update node_pool_max_count in main.tf
Systematic Troubleshooting Approach
1. Check Current State
# Verify Azure login and subscription
az account show
# Check resource group
az group show --name <YOUR_RESOURCE_GROUP>
# List all resources
az resource list --resource-group <YOUR_RESOURCE_GROUP> --output table
2. Validate Terraform State
# Check Terraform state
terraform show
# Validate configuration
terraform validate
# Plan to see differences
terraform plan
3. Monitor Resource Creation
# Watch resource creation in real-time
watch -n 5 'az resource list --resource-group <YOUR_RESOURCE_GROUP> --query "[].{Name:name,Type:type,Status:provisioningState}" --output table'
# Monitor specific resources
watch -n 5 'az keyvault show --name your-vault-name --query "properties.provisioningState"'
4. Emergency Recovery
# Reset problematic resources
terraform state rm module.langfuse.azurerm_key_vault.this
terraform apply -auto-approve
# Force resource recreation
terraform taint module.langfuse.azurerm_application_gateway.this
terraform apply -auto-approve
Error Pattern Reference
Error Pattern | Common Cause | Solution |
---|---|---|
StatusCode=403 |
Permission not propagated | Wait 30-60 seconds, retry |
certificates delete permission |
Missing Key Vault permissions | Add certificate permissions via az keyvault set-policy |
Vault name.*already in use |
Key Vault soft-delete conflict | Delete old Key Vault before domain changes |
name.*already exists |
Resource name collision | Use different name or clean up existing |
Cannot import non-existent |
Import block referencing deleted resource | Remove import block |
exceeds.*character limit |
Name too long | Use shorter name parameter |
MethodNotAllowed.*Purge |
Purge protection enabled | Use different name or disable protection |
ConflictError.*already in use |
Soft-deleted resource | Purge or use different name |
InvalidResourceReference.*redirectConfigurations |
AGIC configuration conflict | Run terraform refresh , then apply |
Best Practices for Troubleshooting
1. Incremental Deployment
For complex deployments, use targeted applies:
# Deploy infrastructure first
terraform apply -target="module.langfuse.azurerm_resource_group.this"
terraform apply -target="module.langfuse.azurerm_virtual_network.this"
# Then security resources
terraform apply -target="module.langfuse.azurerm_key_vault.this"
sleep 60
# Finally application resources
terraform apply
2. Resource Monitoring
# Monitor deployment progress
az resource list --resource-group <YOUR_RESOURCE_GROUP> \
--query "[?provisioningState!='Succeeded'].{Name:name,State:provisioningState}" \
--output table
3. Log Collection
# Collect deployment logs
terraform apply -auto-approve 2>&1 | tee deployment.log
# Collect Azure CLI debug info
az resource list --resource-group <YOUR_RESOURCE_GROUP> --debug 2>&1 | tee azure-debug.log
Complex Multi-Component Failures
502 Bad Gateway with Cascading Infrastructure Issues
Error Pattern: Persistent 502 Bad Gateway errors with multiple component failures
Root Cause: ZooKeeper memory exhaustion triggering cascading failures across ClickHouse, StatefulSets, and Application Gateway
Diagnostic Steps
# Check overall pod health
kubectl get pods -n langfuse
# Check for OOMKilled containers
kubectl describe pods -n langfuse | grep -A 5 -B 5 "OOMKilled"
# Monitor memory usage
kubectl top pods -n langfuse
# Check ZooKeeper logs specifically
kubectl logs -n langfuse statefulset/langfuse-zookeeper --tail=100
Resolution Sequence
Step 1: Address ZooKeeper Memory Issues
# Check current ZooKeeper memory allocation
kubectl get statefulset langfuse-zookeeper -n langfuse -o yaml | grep -A 5 resources
# Increase memory via Helm upgrade (recommended approach)
helm upgrade langfuse bitnami/langfuse -n langfuse \
--set zookeeper.resources.limits.memory=1Gi \
--set zookeeper.resources.requests.memory=512Mi
Step 2: Restore StatefulSet Configuration
# Avoid direct kubectl patches on StatefulSets - use Helm instead
# Check for missing configuration
kubectl get statefulset langfuse-zookeeper -n langfuse -o yaml | grep -E "(image|env|volumeMounts)"
# If configuration is corrupted, restore via Helm
helm upgrade langfuse bitnami/langfuse -n langfuse \
--set zookeeper.auth.enabled=false \
--set zookeeper.allowAnonymousLogin=true
Step 3: Fix Application Gateway Backend Configuration
# Check current backend pool configuration
az network application-gateway address-pool show \
--gateway-name lng-appgw \
--resource-group <YOUR_RESOURCE_GROUP> \
--name defaultaddresspool
# Get current pod IP
kubectl get pods -n langfuse -o wide | grep langfuse-web
# Update backend pool with correct IP format (IP only, no port)
az network application-gateway address-pool update \
--gateway-name lng-appgw \
--resource-group <YOUR_RESOURCE_GROUP> \
--name defaultaddresspool \
--servers <POD_IP_ADDRESS>
Verification Steps:
# Verify all pods are running
kubectl get pods -n langfuse
# Check memory usage is healthy
kubectl top pods -n langfuse
# Test application response
curl -I https://<YOUR_DOMAIN>
# Monitor backend health
az network application-gateway show-backend-health \
--resource-group <YOUR_RESOURCE_GROUP> \
--name lng-appgw
Prevention Measures
Resource Sizing:
- Set ZooKeeper memory to 1Gi minimum for production workloads
- Monitor memory usage patterns and set alerts at 80% utilization
Configuration Management:
- Use Helm upgrades instead of direct
kubectl patch
commands for StatefulSets - Maintain clean values files for environment-specific configurations
Monitoring Setup:
# Set up resource monitoring
kubectl top pods -n langfuse --watch
# Monitor specific memory usage
watch -n 10 'kubectl top pods -n langfuse | grep zookeeper'
🔧 Most deployment issues are temporary and resolve with patience and systematic troubleshooting.