This is a guest blog post by Trevor Parsons. Trevor is Co-founder & Chief Scientist at Logentries, a log management and analytics service built for the cloud, making business insights from machine-generated log data easily accessible to development, IT and business operations teams of all sizes. Logentries is designed for managing huge amounts of data and visualizing insights that matter.
Trevor takes it from here: A huge part of the role of those in the Ops team is to make sure that their systems are in good stead and nothing problematic is going on.
The traditional Ops toolkit often involves a range of tools that help monitor trends in KPIs as well as tools that perform different periodic health checks on key components of your systems.
However, with the rise of everything cloud-based, the demand for log management and analytics technologies has grown significantly. Logs enable Ops team to manage the machine data generated by cloud-based infrastructures, and perform the critically important analysis as well as many of the typical checks your Ops team needs to carry out.
Here are 5 reasons why we think logs are a critical, complementary tool in the modern Ops toolkit:
1 Is my System Clean: Exception Monitoring, Error Tracking, Error History
While developers largely focus on using error tracking tools to eradicate (as best they can) any errors or exceptions in their code, as an Ops guys you want to also be able to track any exceptions that creep into your system once they go live. Sometime these can only occur during certain edge cases or under high load so they may not always be caught by developers or during testing.
Logs can play a vital role here, as any high severity issues can (and should) be recorded in them. Logs give you, as an Ops team, a great way to be able to assess 'is my system clean?' and free from errors or exceptions.
You can periodically review your logs to have confidence that your system is indeed clean, or if you want to be more pro-active you can use your logging service to set up real time alerts on such issues so that you get notified immediately when they occur.
2 Get Notified on Threshold Breaches
Exceptions in the logs are often one of the most obvious ways to spot when something undesirable is occurring in your system. However, sometimes more subtle issues can occur that can denote a serious issue or which can be a symptom of something more drastic. These types of issues may not manifest themselves in the form of an error so they can be sometimes harder to spot.
Setting thresholds allows you to get notified when your system is breaching important system bounds. Some typical examples of thresholds worth tracking and setting alerts on include:
Server request response time: this relates to what you deem as a reasonable amount of time for a request to be returned from the server. The 3 important response time limits are often a good rule of thumb in terms of how long a person will reasonably wait for a page to load for example. Server response time will only make up a percentage of this, so you should bear that in mind.
Client side response time: if you log events from the client (e.g. mobile app or web browser) as well as server request times, you can quickly work out how much of the total time it takes for a page to load is due to the page rendering, the network or your back end processing.
Job running time: if jobs suddenly start taking 10X to run you would generally want to know about this as it may indicate some other load on the system is consuming resources or the nature of your jobs may have changed significantly. Get to know how long a job should take and set thresholds so that you are notified when the total job running time exceeds these significantly.
Server resource usage: if your disk capacity is approaching 95% full and heading towards 100%, it may signify something bad is about to occur… i.e. you are out of disk. The same time can be said for resources like CPU and memory. Get to know reasonable resource requirements of your system and set threshold notifications so that you are alerted when these are breached.
Key application performance metrics: internal application metrics can give you a more fine grained understanding of what is going on in your system vs. server resource usage data. Usually you will need to define these with your development team so that you can understand some of the key data points that relate to the health and performance of your system. For example within Logentries, we make sure to log and track a number of specific queue sizes that our engineers deem as very important measures for our system, as we may need to take action if these grow above a particular threshold.
Note all of the above can be tracked using your logs. Since today’s logging technologies allow you to log both from the server and client side, as well as providing the ability pick out important fields in your log data and to track and analyze these over time (e.g. server monitoring info, application performance metrics, job metrics etc.).
3 Spot Shifts in System Behavior
A less obvious issue is when there has been neither any exceptions nor threshold breaches in your system BUT there has been a sudden shift in system behavior.
These can be more subtle and difficult to spot and require the ability to specify a baseline of ‘normal’ behavior, as well as a deviation from the norm.
Log management solutions that provide log-based anomaly detection allow you to be proactively notified when there has been a sudden change like this. Think of a response time example where, you might deem it unacceptable if response were to breach 3 seconds. Anomaly detection allows you to identify a situation where no threshold breach has triggered, but response time may have gone from an average of 10ms per request to 1 second per request over the course of an hour. Again this sudden shift in behavior is likely something you and the rest of the Ops team would like to know about.
Getting notified when these types of trends start to appear allows you to proactively react before thresholds are actually breached. So you can start to rectify the situation before your customers start to complain.
4 Heartbeat Checks
Sometimes you really want to know when something DOESN’T happen. For example, maybe your servers have stopped sending logs… or an important job hasn’t run for over 24hrs…or nobody has signed up in the past 6 hours (where you normally have 100’s of signups per day)…
In the above situations, there may not necessarily be any exceptions reported, or thresholds breached… but you certainly would like to know about these occurrences… or lack there of.
Heartbeat checks or inactivity alerts allow you to check for heartbeats all across your system. You can configure them for coarse grained notifications such as if your server is up or down and if it has sent any log events in the past 2 minutes, or for more fine-grained checks to understand if specific jobs are running as expected.
Finally, in my opinion, the most powerful aspect of using a logs as a key component of your Ops toolkit, is the ability they provide to be able to correlate data from a range of sources:
from different components from across your system all the way from the client side through your middleware components all the way to your DBs
from every later of your software stack from your application layer all the way down to the OS level
from traditional logs, from APIs and from agents that collect server monitoring and application metrics
Having a single location to view and correlate all this data provides for very efficient analysis when debugging any of the issues identified from 1 to 4 above.
We want to thank Trevor for explaining the power of logs to our Codeship readers. They definitely give great insight in your system. If you are interested in giving Logentries a try you can sign up here for a free trial! Of course Trevor is available for any questions you might have in our comments or on twitter.