Postmortem Report for Website Downtime Incident

Issue Summary

Incident ID: INCNUM-007-20240607

Incident Date: June 7, 2024

Incident Time: 12:00 PM - 03:40 PM UTC

Website: https://vic-webstore.netlify.app/

Total Downtime: 3 hours 30 minutes

Timeline

12:00 PM UTC: Initial reports of the website being down were received from users.

12:10 PM UTC: The outage was confirmed by monitoring systems and the incident response team was alerted.

12:30 PM UTC: Initial diagnosis pointed to an error in the server configuration file.

12:50 PM UTC: The error was corrected but efforts to restart the server was unsuccessful.

01:00 PM UTC: Subsequent diagnosis pointed to server overload due to increased traffic.

01:30 PM UTC: Efforts to restart the server and clear temporary files were unsuccessful.

01:30 PM UTC: Database connection issues were identified as another problem.

02:00 PM UTC: The problematic caching plugin was identified.

02:30 PM UTC: Plugin was deactivated, and database connections were reset.

03:00 PM UTC: Server resources were reallocated, and the website began to stabilize.

03:40 PM UTC: Website was fully operational and accessible to users.

Root Cause and Resolution

  • Typographical error in web server configuration file was corrected.
  • Deactivated the problematic caching plugin.
  • Reset and optimized database connections.
  • Increased server resources (CPU and RAM) to handle the increased load.
  • Implemented traffic rate limiting to manage sudden spikes in traffic more effectively.
  • Conducted a thorough review of all installed plugins and their configurations.

Corrective and preventative measures

Load Testing: Implement regular load testing to understand the website’s capacity and prepare for traffic spikes.

Monitoring and Alerts: Enhance monitoring tools to provide real-time alerts on resource usage, database performance, and plugin conflicts.

Plugin Management: Establish a protocol for testing and validating plugins in a staging environment before deploying them to the live site.

Scalability: Implement auto-scaling solutions to automatically adjust server resources based on traffic patterns.

Traffic Management: Use Content Delivery Networks (CDNs) and caching solutions to distribute load and reduce server strain.

Conclusion

This incident highlighted the need for better traffic management, resource allocation, and proactive monitoring. By implementing the outlined preventive measures and follow-up actions, we aim to enhance the website’s resilience and ensure high availability for our users.

Reported By:

Victor Anokwuru

Software Engineer

7th June, 2024.


Comments

Popular posts from this blog

Postmortem Report for Website Downtime Incident-v2