Scaling Our Logging System

Scaling Our Logging System

At Character.AI, our infrastructure handles thousands of GPUs, powering billions of active user seconds and supporting millions of users every month. This massive scale produces a staggering amount of log data, which is essential for monitoring the performance and reliability of our service.

From Fragmentation to Centralization

Initially, our logging was fragmented and spread across multiple providers. This made debugging difficult, slowed down our queries, and created unpredictable costs. With a small team managing a rapidly growing system, we needed a logging solution that was simple, scalable, and fast.

Our first step was to unify our logs. We decided to be strategic about what data we keep. We capture all error and warning logs in full, but we intelligently sample our high-volume information logs. This allows us to maintain a manageable log volume, billions of entries a month, without sacrificing the critical data needed for troubleshooting. This approach created a centralized logging system that provides a single source of truth for our developers and engineers.

Key Features and Lessons Learned

The impact of this shift was immediate. Queries that once took minutes now return in seconds, giving our teams the real-time visibility they need to quickly identify and resolve issues. This new system empowers our developers to investigate incidents with confidence. We also gained access to key features that streamline our workflow, such as:

  • Live tailing: Real-time visibility across our thousands of servers.
  • Denoise: The ability to automatically collapse common log lines and surface outliers, helping us spot unusual behavior during deploys.
  • Freeform keyword search: Our engineers can paste a snippet from an error or stack trace and instantly start investigating, without needing predefined filters.

This new system gives us a lean and powerful observability stack that allows us to manage our vast infrastructure with ease. It's a key part of how we continue to innovate and maintain a reliable service for our users.

Unifying Observability

Ultimately, our goal is metric unification. We aim to bring all our logs, metrics, and traces into a single platform. This will unlock a unified view for correlation and alerting, allowing our teams to perform comprehensive root cause analysis and resolve issues even faster. Our journey toward full observability continues, with the focus on building a more integrated and powerful system to support our future growth.