Pages Experiencing Slow Loading Times

Incident Report for Uscreen

Postmortem

Date of Incident: 1/6/25

Summary:

We recently experienced degraded performance and limited service interruptions caused by unprecedented traffic growth and an issue within our database infrastructure. A CPU resource leak in our database provider’s system contributed to high resource utilization, compounding the challenge of meeting demand. This incident tested our system’s capacity, and while it caused temporary disruptions, it also highlighted opportunities for immediate and long-term improvements.

Root Cause

Increased Demand:
- Platform traffic and user activity tripled compared to typical levels, leading to an unexpected surge in database load.
- This spike caused memory saturation and an overabundance of connections.
CPU Resource Leak:
- A resource leak in the database provider’s infrastructure led to persistent CPU spikes, limiting system efficiency.
- This issue prevented the system from scaling effectively to handle workloads.
Inefficient Resource Allocation:
- Idle database connections and unoptimized queries further stressed the infrastructure, reducing its ability to respond to peak demand.

Resolution

Immediate Actions:

Increased Database Capacity:
- Doubled CPU and memory resources in our database instance to handle the increased load.
Instance Restarts:
- Refreshed instances to eliminate stale connections and stabilize performance.
Cleared Stale Connections:
- Optimized active connection counts, reducing strain on the system.

Short-Term Adjustments:

Adjusted query handling to improve load distribution.
Enhanced monitoring tools to identify potential resource leaks sooner.

‌

This incident underscores the challenges of balancing rapid growth with infrastructure resilience. By immediately increasing database capacity and addressing inefficiencies, we stabilized the platform for now. Moving forward, we are committed to strengthening our systems through proactive scaling, deeper collaboration with our providers, and better resource management to support your continued success on our platform.

Posted Jan 06, 2025 - 20:31 UTC

Resolved

The issue has been successfully resolved. Thank you for your patience during this process.

Posted Jan 06, 2025 - 19:45 UTC

Monitoring

A fix has been implemented, and we are seeing improvements in the Admin Area and Storefront pages.

Our team is actively monitoring the situation and tracking the improvements.

Posted Jan 06, 2025 - 19:29 UTC

Update

Our team continues to work towards a solution for this issue. Thank you for your continued patience.

Posted Jan 06, 2025 - 19:09 UTC

Update

Our team continues to work towards a fix for the issue causing slow loading times for the catalog and admin pages.

Posted Jan 06, 2025 - 18:35 UTC

Identified

Our team has identified the root cause of the slowdown affecting the Admin and storefront pages. We are currently working on a fix.

Posted Jan 06, 2025 - 17:58 UTC

Investigating

We are currently looking into an issue causing slow loading times for both the Admin and Storefront pages.

Posted Jan 06, 2025 - 17:29 UTC

This incident affected: Admin Portal and Storefront.