LogRhythm and DevOps: Bringing It All Together

Bad Code Rolls Use Case Analytics Rule

There has been a recent uptick in corporate IT Development Operations (DevOps). Many tools, books, and experts sing the praises of the DevOps methodology. It can be difficult to fully recognize exactly what DevOps is and how it works, but if you invest the time and energy to understand and incorporate DevOps methodologies into your organization, you’ll be able to better detect and automate responses to issues found in software.

What is DevOps?

Understanding DevOps is the first step to successfully integrating it into your business plan. DevOps is a methodology that involves the collaboration and communication of both software developers and information technology (IT) professionals. It focuses on the process of automating software and solving infrastructure changes. DevOps aims to ensure that software is developed rapidly and continues to run smoothly once deployed.

Supporting DevOps Functions

To help uncover some of the fundamental connections between development, automation, operations, and security, let’s take a deeper look into the types of problems that DevOps sets out to solve.

You likely have a tool within your security infrastructure that can be utilized to help support these DevOps functions. One such tool is LogRhythm. While LogRhythm can be used to detect standard operational issues such as account lockouts, firewall misconfigurations, SAN performance issues, and permissions issues, you can also apply these same techniques to the area of DevOps monitoring and alerting.

The following is an example of a company starting down the path of utilizing a logging and analytics tool—in this case LogRhythm—to support DevOps. To protect the privacy of those involved, names and identifying details have been changed.

DevOps and LogRhythm in Action

Acme Company is a software as a service (SAAS) provider of online financial services specific to item processing. The company has recently moved to a DevOps strategy to support the automation of core SAAS product development lifecycles. The development and IT operations teams have been asked to work together to automate both the provisioning of infrastructure on demand, as well as the development and deployment of custom code base seen throughout various SAAS environments.

This change in IT strategy brings specific needs and demands. The newly-created DevOps team needs to be able to limit access to various servers to ensure operational and security hygiene while also allowing appropriate staff the ability to quickly query those environments’ log messages. The new code deployment automation process introduces specific requirements around assisting in the monitoring of and response to operational or code-related issues.

With these requirements and needs in mind, the following are three use cases in which Acme Company utilized the LogRhythm NextGen SIEM Platform in a DevOps application:

1. Detecting Bad Code Rolls

An issue Acme Company occasionally experiences is a “bad code roll” in their production environments. This could be due to a failure of their automation tool correctly deploying new code to their environment, or it could be that the code that was deployed had unexpected errors. Acme Company can use LogRhythm to trend general application errors through .NET logs, IIS logs, and MySQL logs between environments in order to more quickly identify servers or environments experiencing a higher-than-normal quantity of errors.

If errors are detected, Acme Company needs to be quickly alerted so that they can then pull the affected systems out of production and re-roll code as needed. The company can implement trending and alarming rules around this use case to significantly reduce the time it takes to detect problems that are a result of code or automation issues. This prevents the dreaded call from the help desk of “things are broken/slow!”

To help avoid bad code rolls, LogRhythm has two pre-built functionalities to quickly build out this use case.

The first feature is LogRhythm’s pre-built Machine Data Intelligence (MDI) fabric, which has the ability to bring in disparate log sources—in this case MySQL logs, IIS logs, and Application logs from .NET. Any mix of technologies could be used, however, to apply processing and classification to these logs. In this example, by programming LogRhythm to collect these log types, you will see automatically applied classifications to each log received, such as “Critical, “Error,” or “Warning.”

The second feature is the LogRhythm AI Engine’s ability to learn behavior and compare that learned behavior’s baseline to current activity. This use case aims to compare various Critical or Error messages from MySQL, IIS, or application data to a baseline, generating an alarm if there is a 2x increase in those errors.

Combining these two features, it’s relatively straight forward to set up the analytics rule for this use case:

Click image to enlarge.

Bad Code Rolls Use Case Analytics Rule Figure 1: Bad Code Rolls Use Case Analytics Rule

To set up this analytics rule, first baseline all log messages coming from the Application, IIS, or MySQL logs that have been classified with Critical or Error for a period of one hour. Once this baseline is learned, look for a 2x increase in the total number of Critical or Error messages when compared to the previous hour.

2. Detecting Slow Loading Webpages

Another issue Acme Company often runs into is very slow page loads in their IIS servers due to application pools that consume all of their resources. There is a known issue with one of the modules Acme Company is using in that app pool and the remediation is fairly simple: recycle the app pool. However, knowing when the app pool resources are running low or when web pages are rendering slowly is not something Acme Company could easily accomplish manually. They would like to be able to bring the affected systems logs—in this case the IIS logs—into LogRhythm to quickly identify the set of pages that are loading slowly. The DevOps team would be notified when this issue is identified and could create a SmartResponse™ to recycle the app pool and return to service in seconds.

The first part of the solution will be to obtain IIS logging page load times, as by default this won’t be logged. To do so, simply enable “time taken” in IIS logging:

Enable “Time Taken” Function in IIS Logging Figure 2: Enable “Time Taken” Function in IIS Logging

Next, set up log collection of these IIS logs into LogRhythm.

Sample of the Processed Logs After Being Collected Figure 3: Sample of the Processed Logs After Being Collected

Notice that the “duration” metadata field automatically parses out the field we are looking for—in this case the page load time.

Once the necessary data is being collected, you can set up an analytics rule to search for instances of slow page load times. In this specific example, look for an indication that more than five pages have taken more than three seconds to load, all occurring within three minutes. This allows you to filter out the noise of a few slow pages over time, and instead focus on high probability issues:

Designing an AI Engine Rule to Identify Slow Loading Pages Figure 4: Designing an AI Engine Rule to Identify Slow Loading Pages

You can easily tie a SmartResponse to this alarm to help remediate the issue quickly. Creating a custom SmartResponse is straight forward. In this case, you simply need to run the command “appcmd recycle apppool /apppool.name: APPPOOLNAME” to recycle the app pool and return performance to normal levels.

3. Delegating Production Log Access to Internal Teams

One of the features that the Acme custom software provides is an e-mail notification to any SAAS customer who completed a process in their custom workflow. This feature is required by law, and if asked by auditors or the internal support team, Acme Company has to be able to quickly prove that they did or did not send said notifications to a particular e-mail address. The logs that provide this level of visibility are found in a custom log format, as Acme Company develops the software internally that creates these logs. Acme would like to give their development team access to search this log data quickly and easily without providing direct access to the production servers.

Again, there are two pre-built LogRhythm features that can be utilized to satisfy this use case.

LogRhythm’s role-based authentication control functionality allows Acme Company to delegate access to the LogRhythm solution in such a way that the development staff responsible for this information can run searches safely while limiting their access to only those logs required for this search function.

As the logs in question are in a custom format, there is no pre-built parser that can be used to normalize the log data to run quick-structured searches such as matching an e-mail address field. However, LogRhythm has the ability to perform a quick search across non-structured data. Acme Company can easily set LogRhythm to collect these custom logs, then quickly search those logs and return the appropriate results indicating success or failure of notification.

The Benefits of DevOps

Maintaining the segregation of duties and giving people access to the tools they need is a main component of the Acme Company’s DevOps strategy, and with little effort these logs can be onboarded and made available for a quick search by the DevOps team.

As you can see in the examples above, by utilizing existing log collection, analytics, and remediation techniques in LogRhythm, specific use cases can be quickly and easily crafted to support the DevOps function. And just as the world of DevOps is focused on continuous improvement and iterative feedback, these use cases can be easily tweaked and additional use cases added to fit your security needs.