Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Outages
Planned Maintenance
Scheduling Maintenance
Maintenance Windows
Root Cause Analysis (RCA)
Objectives of this Document
Describe the procedures for outages & planned maintenance
Identify the proper way to post an alert message for outages
Identify the three stages for scheduling planned maintenance
Post a schedule of maintenance windows
Identify when a Root Cause Analysis (RCA) is needed
Table of Contents

Public information about outages can be found on the main libraries website on the Maintenance Windows page. 

Outages

Procedures for handling reports and updating users about UNPLANNED interruptions of service. Depending on the impact of the outages, ALL of these communication methods may be needed.


Outages Slack Channel

  • Employees should report outages for platforms and network to the outages channel. However, individual reports for personal equipment or accounts should be directed to the appropriate ticketing system.

  • All Digidev units should have a process for monitoring this channel for outage reports.

  • When a report is made, the team responsible for the platform or service should:
    • acknowledge the message
    • provide updates on the status of the issue.

  • Unit managers should post any unplanned outages to this channel as soon as possible and provide periodic updates to users.



Intranet Alert

  • Significant outages should also be posted on the Intranet as an alert.

  • Employees will receive an email when alert is created.

  • It is important to resolve the alert when the event has successfully resolved.


*** It is up to the unit manager to determine what a significant outage is but please err on the side of caution, especially during finals week, etc.


Main Website Alert

  • For significant outages that impact user services, an alert needs to be put on the libraries main website.

  • To add an alert, contact Web Service with the message.

  • It is important to let Web Services know when to remove the message.

*** It is up to the unit manager to determine what a significant outage is but please err on the side of caution, especially during finals week, etc.


Planned Maintenance

We have identified several Monday maintenance windows throughout the year where planned maintenance, upgrades, code pushes can be scheduled to minimize impact on services. The maintenance window will occur between 6:00 AM to 12:00 PM on the scheduled dayevery Monday (except on blackout dates). Please note the process for scheduling an update and communicating updates out to the public.

Scheduling Maintenance


Notifying Stakeholders

  • Make sure all stake holders know about any changes to the system or potential down time. Lead time on notification will depend on how major the change is.

  • For major interruptions, a WEEKLY UPDATE must be submitted prior to the outage with the appropriate lead time. 

  • Web services requires longer lead times so that Libguides, tutorials, and instruction materials can be updated. SHAREOK Partners must be notified in advance as well. 


*** It is up to the unit manager to determine who the internal stake holders are. Any public service must include notifications to the general public.





Notifying Infrastructure Working Group (IWG)

  • The Wednesday before a maintenance window, unit managers should post information about the planned maintenance in the IWG slack channel.

  • IWG committee members will review the change and determine if there are any conflicts.

  • If more than one change is scheduled for the same day, teams should create a schedule so that multiple services are not interrupted at the same time.

  • All significant changes, maintenance, and upgrades need IWG approval, but it is up to the unit manager to determine what changes are considered significant. 


Posting an Alert

  • A few days before, an alert should be posted on the libraries website to notify the public of any potential outages.

  • The day of the maintenance, teams need to post in the slack Outages channel when work has begun.
    Example: "Maintenance on X system has started."

  • Teams should post messages about any unplanned delays.
    Example: "Maintenance on X system is taking longer than expected."

  • Teams should post a message when work is completed. 
    Example: "Maintenance on X system has been completed."

Maintenance Windows

2019

  • June 17th
  • July 15th
  • August 12th
  • August 26th
  • September 9th
  • September 23rd
  • October 7th
  • October 21st
  • November 4th
  • November 18th
  • December 16th

2020

  • January 13th
  • January 27th
  • February 10th
  • February 24th
  • March 9th
  • March 23rd
  • April 6th
  • April 20th
  • May 18th
  • June 1st
  • June 15th
  • June 29th
    • We will have a weekly maintenance window every Monday
    • Since ILLiad weekend usage prevents downtime on Mondays, we will schedule ad hoc windows for changes that require outages
    • The maintenance window will continue to occur between 6:00 AM to 12:00 PM
    • Major maintenance work will be scheduled with impacts and peak dates in mind
    • We will continue to follow the normal process for notifying stakeholders and customers of planned work and/or outages

    List of Mondays that will not be included in the weekly maintenance schedule based on the 2021/2022 Academic Calendar:

    Holidays

    • July 5
    • September 6
    • January 17
    • May 30

    SHAREOK Blackouts (final week to submit work)

    • August 2 
    • December 13
    • May 9

    Unscheduled Deployments

    Most changes should be announced in advance and applied during our prearranged maintenance windows. This policy allows for low-risk and emergency changes outside of this proposed schedule.

    Root Cause Analysis (RCA)

    Unplanned outages require an Root Cause Analysis to be completed by the team in charge. See the RCA documentation for more information.