The focus of many of my blog posts is on topics around how to be more agile. One key strategy for increasing agility is to focus on core competencies and leverage the cloud for all non-core functions. Much of the discussion about cloud services focuses on IaaS, specifically AWS versus the various private cloud solutions, but another way in which companies are achieving agility is by leveraging SaaS solutions for key operations functions.
PagerDuty is a SaaS solution that provides a one-stop shop for alert management. If you look under the covers of today’s cloud architectures, you will likely see a collection of monitoring tools being used for very specific functions. A system may leverage a tool like Nagios or Zabbix to monitor cloud infrastructure, New Relic or App Dynamics to monitor application performance, Pingdom or Keynote to monitor websites, and Splunk or Loggly to collect and monitor log files. The problem this creates for the operations team is that each one of these monitoring solutions has their own unique ways of generating alerts and warnings. This creates a number of issues. First, there is no consistent delivery mechanism for alerting people when issues arise or when important information needs to be proactively communicated. Second, because alerts are arriving from multiple sources, it is challenging to collect data from the various sources to gain knowledge from the alerts and put action plans in place to reduce or correct issues.
PagerDuty addresses these issues. PagerDuty integrates with numerous monitoring solutions and exposes APIs that developers can call directly from their applications so that all alerts are routing through a single point and delivered in a consistent manner. In addition, all data about alerts is now in a centralized database so that the data can be mined and the operations and development teams can easily start identifying patterns to proactively address issues within the system to reduce the number and severity of alerts in the future. PagerDuty also has a built-in escalation process based on user defined rules. How many times in our past have we seen a situation where a critical alert was sent to someone in the early morning hours but remained unanswered for a long period of time? With the auto-escalation rules, the system automatically alerts the next person in line and continues escalating until there is a response, thus improving overall response time to issues.
The fact that PagerDuty is a SaaS solution provides an additional value to an organization. Not only is it valuable because it is one less application that the IT team has to manage (install, patch, monitor, etc.), but it is also not tied to the SLAs of the cloud infrastructure. Why is this important? Well, if the cloud service provider has a service interruption and the alerting platform is sitting on that same infrastructure, the alerting service could be down as well. It is a smart strategy to have your monitoring and alerting services running independent of the cloud infrastructure so that they can be relied upon at times when the infrastructure cannot be. In addition, it is becoming very common in today’s architectures to deploy multi-cloud solutions. Having a centralized alerting solution that can aggregate messages from many different applications across different cloud solutions is a very effective way to manage systems across the enterprise.
PagerDuty has an impressive customer list which includes Heroku, Pinterest, Opscode, and AirB2B. In early 2013 they raised a $10.7 million round from Andreesen Horowitz and are a profitable company. With all of the data they are collecting, it will be interesting to see if they head down a big data path and start exploring machine learning type features to provide valuable proactive analytics to help companies discover early alert patterns. For companies with complex, heterogenous cloud implementations, it would be wise to take a look at SaaS tools like PagerDuty to simplify and optimize operations.