[dpdk-ci] UNH-IOL Community Lab Downtime Post Mortem

Thomas Monjalon thomas at monjalon.net
Mon Oct 11 10:36:53 CEST 2021


I have two main concerns:
1/ We did not have been noticed of the issue.
2/ Restoring the system took 7 days.


08/10/2021 22:29, Lincoln Lavoie:
> Hello All,
> 
> During the CI meeting, there was a request to provide the post mortem
> review of the recent unplanned downtime.
> 
> Timeline:
> * September 27, 8:30am - WHat should have been a routine upgrade to the
> Jenkins server failed, triggering the down time.
> * September 27, 8:40am - Failed upgrade detected through combination of
> automated notifications and job failures in Jenkins.
> * September 27 - October 3 - UNH Team worked to restore the system to the
> original configuration.
> * October 3, 4pm - Server functionality restored
> * October 4, 11:30am - Jenkins pipelines re-enabled for compile and unit
> testing
> * October 5, 11am - Jenkins pipeline for bare-metal performance and
> functional testing re-enabled, after nominal debug / trial run.
> 
> Root Cause:
> The ansible script / playbook used to maintain the lab (including the
> Jenkins server) caused a trust failure of kerberos (between the server and
> the IPA domain controller) used to secure the NFS mounts hosting the
> Jenkins databases, configuration, log output, etc.  This prevented Jenkins
> from starting properly and complicated the restoration of the Jenkins
> service.
> 
> Changes:
> 1. Per the community request, UNH will provide notice to the CI email list
> prior to upgrades, even for routine maintenance upgrades.
> 2. The UNH-IOL notification / monitoring server will be configured to also
> send notifications to the CI email list.  Note, you will see all
> notifications, including routine maintenance, i.e. host reboots, etc.  This
> was indicated as acceptable during the CI meeting.
> 3. This email summary.
> 
> As of Friday afternoon, Jenkins has "caught up" and has a queue of about
> 20'ish jobs, which is about 1 patch worth of testing.  Please let me know
> if there are any questions or if anything else looks incorrect in the test
> results.  We apologize for the inconvenience this caused, while waiting for
> the automated testing to be restored.
> 
> Cheers,
> Lincoln




More information about the ci mailing list