Gardin takes issues and outages and their effect on our customers extremely seriously, and would like to express our sincere apologies for the inconvenience caused by this incident.
We believe it is critical to be transparent and honest when things go wrong, so we can share our learnings and from this build better products. We wanted to provide you with some additional information about the service disruption that occurred on 13th January 2025.
--
Post-incident summary: Gardin Pulse data warehouse outage
What happened?
In the early hours of the morning on 11th Jan 2025, routine maintenance of our central data warehouse cluster was performed automatically by our cloud hosting provider. The maintenance completed without issue, with queued data processing workloads resuming shortly after.
On the morning of 13th Jan, automated alarms alerted the Platform team to delays arising in data processing. A rapid triage investigation revealed that workloads were no longer being distributed evenly across the data warehouse cluster computing resources, leading to severe bottlenecking of compute capacity and consequently hindering the ability of the platform to process incoming data from customer sensor fleets in within the standard 15 minute latency window.
A Priority 1 incident was declared by the Gardin Platform team at 07:00 UTC, and immediate response actions were taken. The cloud provider’s incident support team was engaged to investigate, and the Gardin Platform team attempted to rollback the cluster to its previous version in an attempt to restore previous behaviour. Unfortunately, this rollback process failed repeatedly and required further escalation to the cloud provider’s support team.
While working on a resolution, the Platform team paused all processing jobs to stabilise the system and implemented a temporary mechanism to limit incoming workloads. This approach allowed live data processing to resume approximately 10 hours after the incident was declared. Throughout the incident, no data was lost as all incoming information was securely stored.
Once the live data was restored, the team focused on back-processing the historical data that had been delayed. This process was carried out carefully, ensuring the system remained stable as workloads were gradually reintroduced. By 08:30 UTC on 17th Jan, all customer data was fully processed, and the system returned to normal operation. The incident was officially closed at 09:30 UTC.
Next steps/lessons learnt
We continue to work extremely closely with our cloud hosting provider to understand the root cause of the performance issues that started following the managed maintenance work on the 11th Jan.
The platform team conducted an immediate review into the efficacy of automated system alerts and this highlighted that alerts could have been raised earlier in this situation had more appropriate monitoring of underlying cluster resource limit breaches been in place. This will now be implemented as a high priority.
The platform team has initiated a detailed review into the events leading up to the incident and the subsequent response/recovery process looking at both the technological and business process aspects to identify improvements which can be implemented.
In closing
We apologise for the impact this incident caused for our customers. While we are proud of our track record of availability, any outage is one too many, and we know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.