The Importance of Data
In my previous Journey to the AI-Enabled SOC blog, I mentioned the three key ingredients required to unlock the potential of artificial intelligence (AI) towards improved threat detection and transforming how enterprises realize threat lifecycle management: data, domain, and data science. While we certainly believe nobody understands the security analytics domain better than us and our world class data science team, there is one key ingredient that truly sets LogRhythm apart—our data!
We have spent the past decade developing a highly curated view of machine data (e.g., logs) through our patented data processing technology. We call this view our Machine Data Intelligence (MDI) Fabric and it uniquely empowers our platform’s end-to-end TLM feature set. Over a decade ago, we developed much of our underlying MDI processing technology in support of our long-term analytics vision. I recall an early analytics architecture conversation in which my LogRhythm co-founder, Phil Villella, gave a concise summary: “garbage in, garbage out.”
Since that time, we have invested considerable time and energy evolving our MDI processing technology while also developing the industry’s leading knowledge base on the structure and context of machine data. As a company, LogRhythm has deep MDI comprehension for over 800 different technologies. Our MDI has and will continue to be a data architecture advantage over our competition. It will accelerate our AI innovation and ultimately improve our AI-driven outcomes, just as it has for our recently announced CloudAI-enabled user and entity behavior (UEBA) offering.
In the remainder of this blog, I’m going to share some of the unique characteristics of our MDI. These characteristics provide a uniquely enriched view of machine data that increase the power and accuracy of our search features, while also enabling LogRhythm to describe and detect highly complex threat scenarios with accuracy. These data characteristics are also critically important to developing meaningful behavioral baselines of activity and realizing true threat relevancy of observed behavioral anomalies.
Uniform Data Schema
Our MDI starts with a standard schema applied to all processed data. This schema provides a uniform view of machine data across the systems, applications, and devices that comprise an enterprise’s security, IT, and OT infrastructure. This view allows for consistent analytics across our global customer base by creating an abstraction layer between the specific technologies (e.g., Checkpoint vs. Palo, Windows vs. Linux, Exchange vs. sendmail) of the underlying Security/IT/OT infrastructure and our analytics technologies. When analyzing data, a dropped packet is a dropped packet, a failed login is a failed login, an email received is an email received—no matter the technology type.
Our uniform MDI view is ideal for AI/ML data science and allows our LogRhythm Labs team to develop scenario-based threat models that can be quickly and reliably deployed within any customer environment. This data schema is populated through the parsing of data into contextually aware fields and extensive post-parsing data enrichment features. While consistent parsing is certainly important, our data enrichment capabilities truly set our MDI apart. These capabilities will be covered in-depth in the remainder of this blog.
Common Classification
When processing machine data, it is uniformly classified. Our classification model is divided into three specific domains: Security, Audit, and Operations. Within these domains, data is organized within a normalized classification structure. For instance, within Security, we might classify logs as pertaining to reconnaissance activity or a suspected compromise. Within Audit, examples include successful/failed authentications and access grants. Within Operations, examples include errors, warnings and allowed network traffic. Our common classification structure provides a consistent, high-level view of all processed machine data that increases analytics’ accuracy and opportunity.
User Context
When parsing user credentials from logs, we assign one of two primary data contexts: Was the user impacted by activity, or was the user originating some activity? For example, take a log message reporting access being granted. In that log, there are two different user credentials present. We parse both values and assign context to each value, so our analytics can differentiate the user responsible for assigning the new permissions versus the user for which new permissions were granted.
TrueIdentity
In addition to determining the context for parsed user credentials, we further enrich processed logs with an identity. We call this data enrichment feature TrueIdentity. A TrueIdentity represents the higher-level construct of the actual individual. Take me for instance. My identity is Chris Petersen and related to my identity are many possible identifiers: my corporate AD account, my personal Dropbox account, my work email, my personal email, phone numbers, and so forth. These identifiers are found throughout log data, but unfortunately, my true identity is not.
Our MDI processing addresses this issue by intelligently resolving identifiers to a TrueIdentity. This enables us to analyze data at the identity level versus identifier. Resolved “TrueIdentities” are also assigned the same context as parsed user accounts. If we take the above example log post-MDI processing, we now know the TrueIdentity of the person who assigned the permissions and to whom the TrueIdentity permissions were assigned. TrueIdentity is critical for enabling accurate cross-device scenario analytics and deep behavioral profiling of users in support of UEBA.
Host Context
When parsing host identifiers such as IP addresses, hostnames, MAC addresses, and so on, it is critically important to understand the data context. Is a parsed IP indicative of an attacker or a target? Is a parsed hostname a client or a server? This context is assigned to all parsed host identifiers, enabling analytics in the context of threat relevancy, host role, traffic direction, etc.
TrueHost
In addition to determining the context for parsed host identifiers, we further enrich processed logs with a TrueHost. A TrueHost represents the higher-level construct of the actual server, endpoint, device, and so forth. Consider my laptop. While my laptop’s hostname and MAC address might be constant, it is assigned multiple different IP addresses every day. These host identifiers are found throughout its log data. Logs might contain my laptops hostname, MAC address, or one of the various IP addresses it has been assigned. While these logs have a host identifier, what they lack is a reference to the true host.
LogRhythm’s MDI reconciles multiple host identifiers by intelligently resolving identifiers to a TrueHost. This enables the LogRhythm platform to analyze data at the actual server, endpoint, device, and so forth versus the identifier. Resolved TrueHosts are also assigned the same context as parsed IPs, hostnames, etc. TrueHost is critically important when trying to accurately model threat scenarios across disparate sources of log data and enables deeper and more accurate behavioral profiling in support of network traffic and behavioral analytics (NTBA) .
TrueGeo
For all IP addresses and resolved TrueHosts, we try to determine the physical location of the endpoint, server, device, etc., down to city-level resolution. Similar to parsed host identifiers (e.g. IP addresses) and resolved TrueHosts, each TrueGeo is also assigned client/server or attacker/target context.
TrueApp
For certain types of log messages, it is incredibly helpful to have a consistent view of the application involved in whatever activity or issue being reported. However, there is no universally consistent way by which applications are expressed in log data. Fortunately, TrueApp takes care of this. TrueApp leverages parsed data, such as ports, protocols, and intelligence embedded in our knowledge base, to automatically assign a consistent TrueApp (e.g., SSH, FTP, Dropbox, etc.) to relevant log messages. Our NetMon product also reports layer 7 network sessions with TrueApp context for over 3,200 applications. TrueApp allows us to correlate and analyze activity across network data, system, application, and audit logs for application aware threat scenario modeling and behavioral profiling.
TrueTime
Last, but far from least, LogRhythm goes to great measures to assign a TrueTime to every log message. A log message’s TrueTime is our best possible determination of the actual time it was originally written. TrueTime is recorded in Coordinated Universal Time (UTC) down to millisecond resolution. The various means by which we accomplish this is worthy of another blog post itself. Hopefully you’ll trust me when I tell you that, at LogRhythm, we take accurate time representation very seriously. After all, inaccurate time representation can result in a threat scenario being missed (e.g., time/sequence sensitive correlated activity) and can effectively corrupt behavioral profiles. TrueTime is critical to our mission of helping customers accurately and reliably detect threats before risk is realized.
I hope this blog post has helped you understand our MDI capabilities. We have invested heavily in our data processing technology and our MDI knowledge base because we firmly believe analytics opportunity and accuracy is unleashed via a cleaner and richer data set. Our investments in MDI have benefited LogRhythm customers for years and we are excited to see its benefits further unfold as our data scientists leverage our MDI advantage to accelerate our journey to the AI-enabled SOC.