- Most often outages happen because of config changes in dependencies used by your application. To give an example, lets say you are running hadoop jobs on a cluster and relying on HADOOP_USER_NAME to set the privileges under which your job runs. This has been working for quite some time. But now the cluster admin enables kerberos on the cluster and your job suddenly fails and you have an outage.
When Kerberos is disabled, the identity of a user is picked up by Hadoop first from the environment variable HADOOP_USER_NAME, then from the OS-level username (e.g. the system property user.name).
2. Second most common cause of outage is outage in a dependency. Say you are relying on AWS S3 to fetch some static content. And this happened.
3. A third common reason is expired credentials. Your service has been working fine for months. But now all of a sudden some credentials expire and you get a pager alarm.