Adobe: Building a Metrics and Monitoring System

What were the underlying reasons or business implications for the need to automate business processes?

With the growing business, Business teams wanted to effectively monitor the process, alert the failures and stay ahead of bottlenecks. Service adhered by SLA’s needs to be met to avoid the revenue impact or any financial losses. An effective monitoring to be in-place to reduce the failure risk and for business continuity.

Describe your automation use case(s) and business process(es).

Metrics and Monitoring implemented helped the teams understand the state of the infrastructure and services. A standard solution was implemented to effectively collate the metrics from the Platform. On the service level, we’re able to collect the data at minute level to understand how the data trend changed over a period.

Describe how you implemented your automations and processes.

A Service Exporter is implemented, which is a Pull Metric Exporter which reports metrics by responding to the scraper requests. The Exporter service is implemented in python to trigger the services implemented and push the data to Prometheus. The data in Prometheus is used to create alerts and dashboards.

Technologies that helped us to build a M&M system

  1. SnapLogic for middleware
  2. Python to build Custom App
  3. Kubernetes to deploy the App
  4. Prometheus and Grafana Dashboards to view data
  5. Alert manager
  6. Self-Servicing to configure the threshold for the alert.

Over 50+metrics collected to monitor the platform and services onboarded. Few metrics collected that help business monitor and alert on the progress of service -

  1. Delayed Executions
  2. Missed Schedules
  3. Network issues
  4. Track failures to a change in asset
  5. Spikes in resources over a period
  6. Continuous failure of a service

What were the business results after executing the strategy?

  1. The Framework implemented enables us to identify the infrastructure related issues during deployments.
  2. Helped us for smoother releases by enabling us to verify for failures before and after the Snaplogic Cloud/Data Releases.
  3. Isolate the services contributed to any resource spikes.
  4. Business teams were alerted for the failure of the service and more insight also provided which include if asset level changes contributed for the issue or downstream systems not available.
  5. Identify and alerting on the issues enabled business to debug issue before it grows in proportion.

Who was and how were they involved in building out the solution?

Adobe Team

Anything else you would like to add?

Anomaly detection across the platform and services onboarded will be a better solution if provided in-house.