ClickHouse > Case Studies > Leveraging ClickHouse for Efficient OpenTelemetry Tracing: A Resmo Case Study

Leveraging ClickHouse for Efficient OpenTelemetry Tracing: A Resmo Case Study

ClickHouse Logo
Technology Category
  • Application Infrastructure & Middleware - Database Management & Storage
  • Infrastructure as a Service (IaaS) - Cloud Storage Services
Applicable Industries
  • Equipment & Machinery
  • Retail
Use Cases
  • Intrusion Detection Systems
  • Time Sensitive Networking
Services
  • System Integration
About The Customer
Resmo is a tool that collects configuration data from Cloud and SaaS tools using APIs. It allows users to explore this data using SQL to ask any question they want. Resmo comes with thousands of pre-built SQL-based rules and questions and also provides visual exploration capabilities of the collected data through filters, free text search, or graph. Customers can create their own rules or use automation to receive notifications via various channels when there are changes to the data or rule status. Resmo's data collection generates more than 300 million spans per day, and this number is rapidly increasing with the customer size.
The Challenge
Resmo, a tool that gathers configuration data from Cloud and SaaS tools using APIs, faced a significant challenge in managing the large volume of network calls resulting from collecting data from thousands of APIs. The traditional approach of logs was too verbose and difficult to query, while aggregated metrics lacked sufficient context for detecting and diagnosing specific issues. Resmo utilized tracing, which provided a better view of the flow of requests and their associated responses. However, the volume of spans generated by Resmo's data collection was excessive, and the usual approach of sampling could cause blind spots, making it difficult to identify issues on non-happy paths of execution that happen rarely. Furthermore, many vendors charge by the number of ingested events and the volume of data per GB, which can be costly without any sampling. Only a few vendors allow custom SQL queries on the data.
The Solution
Resmo decided to use full tracing (no sampling) with OpenTelemetry and ClickHouse for cost-effective and efficient storage and querying of traces. Initially, they considered using S3 and Athena, but the fixed startup delay of 2-3 seconds for Athena was a drawback. They hosted their own ClickHouse instance, which allowed them to store more than 4 billion spans with a 92% compression percentage. To improve the performance of common queries, they added materialized columns for frequently used fields in queries, monitors, and dashboards. They also used the out of the box configuration of Opentelemetry Collector with ClickHouse and Java agent for distributed tracing, adding manual instrumentation in the form of context-specific tags to their spans. They connected ClickHouse to Postgres for their observability queries, joining user and tenant IDs in their spans to the actual account names and account status in the Postgres database. For visualizing data, they used Grafana, and for writing queries, they used IntelliJ IDEA & DataGrip.
Operational Impact
  • The implementation of ClickHouse and OpenTelemetry for full tracing has significantly improved Resmo's observability game. The solution has allowed Resmo to efficiently store and query traces, providing a better view of the flow of requests and their associated responses. The addition of materialized columns for frequently used fields has significantly improved query performance without affecting storage or the compression rate. The ability to connect ClickHouse to Postgres has enabled Resmo to use it in their observability queries, joining user and tenant IDs in their spans to the actual account names and account status in the Postgres database. This has provided unprecedented flexibility and allowed Resmo to expose this flexibility to their customers so they can easily ask arbitrary questions. The use of Grafana for visualizing data and IntelliJ IDEA & DataGrip for writing queries has further enhanced the efficiency and effectiveness of Resmo's observability strategy.
Quantitative Benefit
  • Resmo's ClickHouse instance can store more than 4 billion spans.
  • The data stored in ClickHouse consumes 275 GiB on disk, which uncompressed is 3.40 TiB - a 92% compression percentage.
  • Queries which scan all of the data complete rather quickly, and are mostly limited by the disk bandwidth.

Case Study missing?

Start adding your own!

Register with your work email and create a new case study profile for your business.

Add New Record

Related Case Studies.

Contact us

Let's talk!
* Required
* Required
* Required
* Invalid email address
By submitting this form, you agree that IoT ONE may contact you with insights and marketing messaging.
No thanks, I don't want to receive any marketing emails from IoT ONE.
Submit

Thank you for your message!
We will contact you soon.