Gardin - Processing delays – Incident details

Processing delays

Resolved
Major outage
Started 5 days agoLasted 3 days

Affected

Reporting App (app.gardin.ag)

Partial outage from 9:08 AM to 8:36 AM, Operational from 8:36 AM to 5:21 PM

APIs

Partial outage from 9:08 AM to 8:36 AM, Operational from 8:36 AM to 5:21 PM

Query API

Partial outage from 9:08 AM to 8:36 AM, Operational from 8:36 AM to 5:21 PM

Updates
  • Update
    Update

    Gardin takes issues and outages and their effect on our customers extremely seriously, and would like to express our sincere apologies for the inconvenience caused by this incident.

    We believe it is critical to be transparent and honest when things go wrong, so we can share our learnings and from this build better products. We wanted to provide you with some additional information about the service disruption that occurred on 13th January 2025.

    --


    Post-incident summary: Gardin Pulse data warehouse outage 

    What happened?

    In the early hours of the morning on 11th Jan 2025, routine maintenance of our central data warehouse cluster was performed automatically by our cloud hosting provider. The maintenance completed without issue, with queued data processing workloads resuming shortly after. 

    On the morning of 13th Jan, automated alarms alerted the Platform team to delays arising in data processing. A rapid triage investigation revealed that workloads were no longer being distributed evenly across the data warehouse cluster computing resources, leading to severe bottlenecking of compute capacity and consequently hindering the ability of the platform to process incoming data from customer sensor fleets in within the standard 15 minute latency window. 

    A Priority 1 incident was declared by the Gardin Platform team at 07:00 UTC, and immediate response actions were taken. The cloud provider’s incident support  team was engaged to investigate, and the Gardin Platform team attempted to rollback the cluster to its previous version in an attempt to restore previous behaviour. Unfortunately, this rollback process failed repeatedly and required further escalation to the cloud provider’s support team. 

    While working on a resolution, the Platform team paused all processing jobs to stabilise the system and implemented a temporary mechanism to limit incoming workloads. This approach allowed live data processing to resume approximately 10 hours after the incident was declared. Throughout the incident, no data was lost as all incoming information was securely stored. 

    Once the live data was restored, the team focused on back-processing the historical data that had been delayed. This process was carried out carefully, ensuring the system remained stable as workloads were gradually reintroduced. By 08:30 UTC on 17th Jan, all customer data was fully processed, and the system returned to normal operation. The incident was officially closed at 09:30 UTC. 

    Next steps/lessons learnt

    1. We continue to work extremely closely with our cloud hosting provider to understand the root cause of the performance issues that started following the managed maintenance work on the 11th Jan. 

    1. The platform team conducted an immediate review into the efficacy of automated system alerts and this highlighted that alerts could have been raised earlier in this situation had more appropriate monitoring of underlying cluster resource limit breaches been in place. This will now be implemented as a high priority. 

    2. The platform team has initiated a detailed review into the events leading up to the incident and the subsequent response/recovery process looking at both the technological and business process aspects to identify improvements which can be implemented. 

    In closing

    We apologise for the impact this incident caused for our customers. While we are proud of our track record of availability, any outage is one too many, and we know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.

  • Resolved
    Resolved

    We’re pleased to inform you that full service was successfully restored overnight, and we are now reprocessing the affected historical data to ensure everything is back to normal.

    Our focus is now shifting to thoroughly investigating how this incident occurred and identifying the lessons we can learn to prevent future disruptions. We are committed to transparency and will share a detailed post-incident review report with you via this status page once our analysis is complete.

    We sincerely apologise for the inconvenience caused and thank you for your patience and understanding as we work to make things right.

  • Update
    Update

    We are writing to provide an update on the ongoing system performance issues.

    In collaboration with our cloud-computing provider, we have now identified the underlying root cause and our engineering teams are working diligently on a permanent fix. We are targeting end of play today to have current incoming index data restored for all customers. We will then proceed with the restoration of missing historical data.

    We understand the impact this outage is having on our customers, and sincerely apologise.

  • Update
    Update

    We are making good progress in continuing to restore customer indices data. We are implementing as quickly as possible an alternative architecture to address ongoing performance issues, and engineering teams are working with our cloud-hosting provider to deliver this as soon as possible.

  • Update
    Update

    We are continuing to monitor the fix implemented by our engineering team and remain in close contact with our cloud computing provider whilst we complete the re-enablement of all tenant indices data streams.

    We are now targeting resumption of incoming indices data for all customers by lunchtime of Tuesday 14th Jan. Once this has been achieved, the reprocessing of historical data since 11th Jan will be undertaken.

  • Monitoring
    Monitoring

    We are writing to provide an update on the ongoing system performance issues. We have been in contact with our cloud computing provider, who are assisting us in diagnosing the root cause with the highest priority.

    We have started re-enabling the processing of indices data to customers and are actively working to resolve any remaining issues as quickly as possible. We sincerely apologise for the continued inconvenience and thank you for your patience as we address this matter.

  • Identified
    Identified

    We are continuing to work on a fix.

  • Investigating
    Investigating

    We are currently experiencing delays in processing index data and the engineering team is fully engaged on providing a resolution as fast as possible. Please be assured that no data has been lost, and all historical data will be fully processed and made available via both the Gardin reporting app and the Query API as soon as possible.

    We sincerely apologise for any inconvenience this may cause and appreciate your patience as we work to resolve this issue swiftly.