Engineering a SIEM part 3: Creating cost-effective, scalable detections

Published

Jul 31, 2024

Previously, we explored the foundational elements of building a Security Information and Event Management (SIEM) system, from defining essential requirements for an effective SIEM to adopting a data lakehouse approach for log ingestion. In this blog post, the third part of this series, we’ll delve into creating detections and streamlining the alert flow. 

Essential components for a robust detection framework

Before we dive into the technical aspects of detection creation and its architecture atop the lakehouse, let’s outline the requirements for the detection engine we aimed to develop. 

1. Consistent format and standardization

While developing security detections, we needed them to meet a set of standardized criteria and undergo automatic validation before production deployment. Essential attributes for each detection definition had to be identified, ensuring unique identification and effective categorization of generated alerts for efficient management and reporting. It was crucial to avoid discrepancies between detection logic and documentation. Additionally, detection definitions required comprehensive details on managing the alerts they generated.

2. Scalability

Our detection engine needed to be scalable from the outset, capable of handling hundreds of detections without significantly increasing costs. Adding a new detection should only lead to a nominal expense increase. In other words, expanding our detection capabilities should not necessitate a trade-off between cost and security.

3. Efficient alert handling and reporting

To streamline alert handling and engage the entire company, we needed a system that integrates seamlessly with our company-wide ticketing and communication tools, like Jira and Slack. This system needed to automate user feedback and accelerate alert resolution. It also needed to support dynamic deduplication of alerts using information within the alert body and enable contextual linking to reduce alert frequency.

Furthermore, we needed to be able to create clear, comprehensive dashboards displaying data from our lakehouse, integrated with our existing company-wide business intelligence platform. Integrating with the company's BI tools ensures that all stakeholders have a unified view of the data, promoting consistency in reporting and decision-making. It also leverages existing infrastructure and user familiarity, reducing the learning curve and improving the adoption of new insights derived from our SIEM data. This integration allows for more efficient data analysis and better resource allocation across the organization.

4. Efficient alert routing

It is crucial that alerts be directed to the appropriate team based on their specific content and the context needed for resolution. This necessitates a system that supports flexible routing to prevent the detection and response team from becoming overwhelmed with coordination tasks. Alerts that require specialized knowledge should automatically route to the right teams, reducing delays and enhancing efficiency. Additionally, the system must allow for dynamic changes in alert destinations, potentially based on attributes like AWS tags.

It’s important to differentiate this requirement from engaging individuals for input on alert resolution. Here, we are discussing the delegation of the entire alert handling process to other teams, rather than automating individual user feedback.

5. Detection as code

One of our guiding principles in defining requirements for the detection engine was adherence to best practices in software development. This led us to embrace "detection as code" from the start. All our detections are managed as code, maintained in a version-controlled environment similar to Git. Before we implement any changes, we follow a rigorous process comparable to that used in software development. Additionally, our approach emphasizes thorough testing to make sure all detections are reliable and generate accurate alerts.

6. Built-in automation capabilities

While creating detections, it’s crucial to consider the alerts they generate and how to manage them without being overwhelmed—so one of our key requirements when developing our detection and response program was to reduce the manual effort involved in handling alerts. Our detection engine must not only support automation of alert handling but also have these capabilities inherently built-in. This allows for interaction with external tools via APIs. Adding new capabilities to automate alert handling should not require designing new solutions—it should be integrated directly into the platform itself.

7. Support for LLMs

In a time when AI is a hot topic, creating a detection engine without support for large language models (LLMs) in its detection logic would be ineffective. We recognize this—and have ensured our system can seamlessly integrate the capabilities of large language models into our detection engine. LLMs enhance detection effectiveness by processing unstructured data, providing contextual understanding, reducing false positives, and dynamically adapting to new data patterns, making our security monitoring more accurate and efficient.

Our approach to the detection engine

Before we started implementing our detection engine, we evaluated our ambitions against our capabilities, recognizing our limited resources. We set an ambitious goal to have a fully functional detection system within 2-3 months of starting development. We aimed to use technologies familiar to the DART team. After analyzing options like the AWS SDK for Python and Serverless Framework, we chose Terraform for cloud resource provisioning. This decision was based on the team's ability to debug potential issues with the detection pipeline effectively, and also, to be able to add new features to the platform. 

To meet the requirement for built-in automation capabilities, we chose Python for handling the output from Snowflake queries. This choice was made not only to automate alert handling but also to implement a set of reusable libraries for all detections. These libraries include essential features such as:

  • Deduplicating similar alerts
  • Abstracting interaction with alert destinations
  • Sending automatic inquiries to users
  • Contextually linking alerts in the ticketing system

Core architecture

Our detection engine is built on top of data ingested into our lakehouse. Each detection consists of two parts: a scheduled task to periodically run SQL queries and a Lambda function to handle the output from the queries and route it to the appropriate place for further handling.

Scheduled tasks

We use Snowflake tasks for scheduling, created in a dedicated database and schema for better grouping and access control. These tasks contain the main detection logic, with the SQL query defined in YAML files. When these tasks produce any output, it is passed as an argument to external functions that invoke dedicated Lambda functions in AWS. Importantly, the integration between the lakehouse and AWS is performed through authenticated calls using IAM roles, which eliminates the need to maintain credentials.

Lambda functions

In our setup, Lambda functions handle the output from SQL queries. We developed a set of libraries mapped to Lambda layers for better code reuse, easily imported into the Lambda function environment. These libraries process the output from SQL queries and ensure alert deduplication, select appropriate alert destinations, and provide additional features like contextual linking and sending inquiries to users on Slack for their input in automatic alert handling. 

Deployment to cloud

We use Terraform for deployment. However, we enhanced our Terraform usage with separate modules and decided to incorporate Terragrunt. This decision stemmed from Terraform's lack of support for parallel module deployments. Given our plan to implement hundreds of detections, sequential deployment would significantly increase build time. With Terragrunt, independent modules can be deployed in parallel, optimizing the deployment process and reducing build times.

Detection as code

One of our core requirements was to adopt the “detection as code” principle. To easily define new detections and ensure compliance with our standards, we decided to map all detection metadata in YAML files, which can be easily validated in the CI/CD pipeline. By adopting the “detection as code” principle and storing our detections in a GitHub repository, we can easily track changes and follow a git-based workflow to ensure that all changes are reviewed before being pushed to the production environment.

Detection: “AWS console login using the Root account” 

The following code snippet is a real example of an alert. Note that all of the essential attributes and logic are included in the YAML file to ensure compliance with our standards and streamline alert handling.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 CreationDate: 2022-02-15 Author: piotrszwajkowski Identifier: DET-1 Severity: Sev0 Enabled: True DedupPeriod: 120 DetectionPythonFile: root_user_console_login_rule.py Environment: - prod Name: AWS Root user console login Description: > This detection is intended to trigger if the Root user sings in into AWS console RequiredLogSources: - LOGS_PROD.RAW.CLOUDTRAIL_VIEW Tags: - Intrusion - PolicyViolation ScheduleTime: '*/5 * * * *' MitreTactics: - Defense Evasion - Persistence - Privilege Escalation - Initial Access MitreTechniques: - T1078.004 AlertCategory: Unauthorized Access EventSource: - AWS-CloudTrail SQL: > SELECT additionaleventdata:MFAUsed as mfaused, sourceipaddress, eventid, eventname, eventtype, eventsource, recipientaccountid, useragent, useridentity:arn useridentityarn, useridentity:type useridentitytype, event_time, parse_time FROM LOGS_PROD.RAW.CLOUDTRAIL_VIEW WHERE eventname = 'ConsoleLogin' AND useridentity:type = 'Root' AND responseelements:ConsoleLogin = 'Success' AND parse_time > dateadd(minute, -7, sysdate()) Runbook: | 1. Go to the lakehouse and gather more context around the activity by correlating other log sources based on the source IP address, including: - Sign-in attempts to Rippling IDP - EDR activity - AWS CloudTrail 2. If you manage to correlate the user who performed the activity, reach out to them on Slack for confirmation and more context. 3. If necessary, follow the Incident Response process for escalation. References: - https://docs.aws.amazon.com/signin/latest/userguide/introduction-to-root-user-sign-in-tutorial.html Destinations: - Name: Slack SlackChannelIdSSMParam: DartSlackChannelId - Name: Jira ProjectKey: DART IssueType: Alert - Name: OpsGenie OpsGeniePriority: P1 OpsGenieTeam: DART Tests: - Name: Root user console login ExpectedResult: True SnowflakeOutput: { "RECIPIENTACCOUNTID": "111111111111", "EVENTID": "b4170c9b-561e-49f9-8245-96f51f2497ed", "EVENTNAME": "ConsoleLogin", "EVENTSOURCE": "signin.amazonaws.com", "EVENT_TIME": "2023-02-14 22:17:30.000", "EVENTTYPE": "AwsConsoleSignIn", "MFAUSED": "Yes", "SOURCEIPADDRESS": "127.0.0.1", "USERAGENT": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36", "USERIDENTITYARN": "arn:aws:iam::111111111111:root", "USERIDENTITYTYPE": "Root" }

The above YAML file references a file named root_user_console_login_rule.py, the content of which is shown below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 from dataclasses import dataclass from base_helpers.base_helpers import base_lambda_handler from base_helpers.detection_helpers import deep_get @dataclass class DetectionModel: event_time: str event_name: str event_type: str event_source: str event_id: str mfa_used: str recipient_account_id: int user_identity_arn: str user_identity_type: str source_ip_address: str user_agent: str def parse_event(event): model = DetectionModel( event_time=deep_get(event, 'EVENT_TIME'), event_name=deep_get(event, 'EVENTNAME'), event_type=deep_get(event, 'EVENTTYPE'), event_id=deep_get(event, 'EVENTID'), event_source=deep_get(event, 'EVENTSOURCE'), mfa_used=deep_get(event, 'MFAUSED'), recipient_account_id=deep_get(event, 'RECIPIENTACCOUNTID'), user_identity_arn=deep_get(event, 'USERIDENTITYARN'), user_identity_type=deep_get(event, 'USERIDENTITYTYPE'), source_ip_address=deep_get(event, 'SOURCEIPADDRESS'), user_agent=deep_get(event, 'USERAGENT') ) return model def dedup(event): model = parse_event(event) return f"{model.source_ip}-{model.account_id}-{model.event_name}" def rule(event): return True def title(event): model = parse_event(event) return f"AWS Console Root Login from {model.source_ip_address} in {model.recipient_account_id} AWS account" def alert_context(event): return event def lambda_handler(event, context): return base_lambda_handler(event, context)

Let's take a closer look at the Lambda code:

  • The entry point is the lambda_handler function, which directs the flow to our main helper, the base_lambda_handler function. This function manages the entire flow of the Python code, including interactions with alert destinations and alert deduplication. It also helps parse the Snowflake query results into separate events, with each row representing an event.
  • For detection logic, we instantiate a DetectionModel object for each event. This approach parses the query result columns and structures the code, allowing for clear references to different attributes.
  • The rule function determines whether an alert is triggered, returning True to send an alert and False otherwise. While most detection logic is in the SQL query, having the ability to stop alerting in Python allows external API calls to influence the detection flow. A clear example of this process is provided further in the document.
  • The title function dynamically sets titles for alerts based on the alert body content. 
  • The alert_context function is a placeholder for any external API calls that could enrich alerts with additional information helpful in handling them.
  • The dedup function is described in the next section.

The need for alert deduplication

Alert fatigue is a significant and chronic problem for Detection and Response Teams (DART). Poor management of alerts and the lack of deduplication can lead to overlooked critical alerts and team burnout. A mature approach to detection engineering considers the impact of alerts on the responding team, ensuring they are not overwhelmed and can maintain high performance and accuracy. For instance, if a user performs multiple manual actions in a production environment, it’s more practical to alert the on-call team once rather than each time the user acts. This highlights the importance of the deduplication function (dedup).

The dedup function constructs a string for alert deduplication to achieve two primary objectives:

  1. Preventing duplicate events from overlapping query executions. Since SQL queries can be scheduled with minute-level granularity, we extend the time condition applied to the parse_time column (which records when a log entry arrives in the Snowflake table) to cover a slightly larger time frame. This prevents important events from being missed but may result in duplicate events being returned by the SQL query in different executions.
  2. Reducing noise from repetitive actions. Some detections can be quite noisy. As mentioned earlier, if a user performs repetitive manual actions in the production environment, it is more practical to alert the on-call team once and allow them to confirm the activity and provide context, rather than triggering an alert every time the user acts. This reduces alert volume but involves a trade-off, as some important events could be missed.

Handling data duplication from Snowflake

According to the Snowflake documentation, Snowflake does not guarantee that each row is processed exactly once by remote services. We aimed to be resilient to duplicated data coming from Snowflake queries to avoid flooding our alerting system and the response team.

Under the hood: How does dedup work?

When an SQL query returns results, Snowflake triggers Lambda functions. Because Lambda functions are stateless, we use a DynamoDB table for persistent storage to manage deduplication.

String construction: If the string is not present in the DynamoDB table within the detection context, it is added with the timestamp.

Persistent storage: f the string is not present in the DynamoDB table within the detection context, it is added with the timestamp.

Checking for duplicates: When a new alert from the same detection is triggered, Lambda reconstructs the deduplication string and queries DynamoDB. It checks for the string's existence and compares timestamps.

Triggering or suppressing alerts: If the deduplication string exists and the timestamp difference is less than or equal to the dedupPeriod (in minutes) defined in the YAML file, the alert is not sent. Otherwise, a new alert is triggered.

This mechanism ensures that the alert system remains efficient and prevents the team from being overwhelmed by redundant alerts.

Customizable alert routing

To be able to route alerts to specific destinations, we added a dedicated key to the YAML structure and implemented the corresponding logic in a Lambda layer. This allows us to specify the project in Jira where alerts should be created, the team in Opsgenie to be paged, or the Slack channel for notifications. Alert destinations can also be dynamically selected and overwritten in the Python code, which gives us flexibility to direct alerts to the appropriate teams.

CI/CD pipeline

To automate the testing and deployment of detections into both the lakehouse and AWS, we developed a dedicated Continuous Integration/Continuous Deployment (CI/CD) pipeline. This pipeline includes various checks to ensure detections comply with our standards, contain all mandatory fields in the YAML files, and are free from logical errors. We established two separate environments: one for development and one for production. When an engineer creates a new PR, the CI/CD pipeline automatically deploys the changes to the development environment. Once all checks pass and the PR is reviewed by another engineer, the change can be deployed to production using a different workflow.

Unit tests and integration tests

When a new PR is created or updated, the CI/CD pipeline validates the YAML files and requires the inclusion of the Tests fields in the YAML file. This validates the entire pipeline through both unit and integration tests. Test cases from the YAML file are first sent to the functions defined in the detection Python file, and then used to ensure the integration with the lakehouse is error-free.

Git pre-commit hooks

To expedite the development process and catch errors early, we execute all unit tests directly on the developer’s laptop. This approach prevents long-running builds and avoids overloading the pipeline. By running tests locally, engineers can quickly identify and resolve issues before committing code, ensuring a smoother and more efficient development workflow.

Scalability and cost-effectiveness

Our scalability focus is not just to create numerous detections, it's also to ensure that the system we build is cost-efficient. We leverage serverless tasks in Snowflake, allowing us to bypass warehouse size limitations. According to the Snowflake documentation, a single account can support up to 30,000 tasks, which meets our needs. On the AWS side, tens of thousands of Lambda functions can run concurrently within a single account.

Prioritizing Snowflake's efficiency is another mechanism we use to be cost efficient. The SQL queries only generate output when specific activities are detected, minimizing Lambda costs. Our estimated cost for running detection on AWS CloudTrail logs is approximately 1.8 credits per month in Snowflake, or $4.50 USD. We can achieve this low cost due to our query execution and data clustering strategies.

First, we cluster logs in Snowflake based on the event time truncated to an hour, grouping logs generated within the same hourly time frame. This clustering ensures that our queries target only a small subset of logs. For instance, the AWS Root Login detection scans an average of 50-70 MB every five minutes.

Second, we execute detections as serverless tasks, eliminating the need for pre-provisioned and over-provisioned warehouses. This approach dynamically allocates resources based on actual workload, optimizing cost-efficiency and maintaining performance without unnecessary expenditure. This strategic use of serverless tasks allows us to scale efficiently while keeping costs under control.

Built-in automation capabilities

When considering a modern Detection and Response team, automation isn't just an addition—it's a core feature. Without automating repetitive tasks in alert handling, team members can easily become overwhelmed and burn out. Traditionally, security orchestration, automation and response (SOAR) platforms were layered on top of SIEM-produced alerts. In our approach, SOAR is integrated within our SIEM, removing the boundary between the two. 

We achieved this integration using Lambda functions and Lambda layers. For new automation features, we create global helpers (mapped to Lambda layers), which the CI/CD pipeline deploys. When these global helpers need to call external APIs, they often require credentials for authentication. To securely store these credentials, our Terraform modules facilitate the creation of new secrets using AWS Secrets Manager. This ensures that the credentials are available and secure when calling external APIs.

The following code snippet demonstrates the built-in automation capabilities. This simple automation checks if a given IP address is assigned to any endpoint reporting to SentinelOne. If the rule function returns true, an alert is triggered; otherwise, no alert is generated. This approach can prevent unnecessary alerts and save analysts' time across various detections.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 from dataclasses import dataclass from base_helpers.base_helpers import base_lambda_handler from base_helpers.sentinelone import SentinelOne (...) def rule(event): model = parse_event(event) sentinelone_connector = SentinelOne() results = sentinelone_connector.get_endpoints_for_ip(model.source_ip) if len(results) == 0: return True else: return False

Resiliency to API errors

API calls can fail for various reasons, such as network issues, server overloading, or API rate limits. To address potential issues, we designed our system with resiliency in mind and analyzed different components of the architecture.

Snowflake external functions

Snowflake external functions are called synchronously, meaning they wait for a response until completion. To avoid timeouts (since Lambda functions can run up to 15 minutes), we call the Lambda function asynchronously after receiving data from Snowflake. The direct execution invoked by Snowflake is used only to re-run the Lambda asynchronously and reply to Snowflake with an HTTP 200 status and the same results passed to the external function.

If there's an error from AWS API Gateway or any other network component, Snowflake retries requests until a total retry timeout is reached, providing some level of resiliency. However, if the external function fails, the entire task in Snowflake is marked as failed, and our on-call team is notified through scheduled task monitoring (described in Engineering a SIEM part 2: Rippling's security data lakehouse and modular design).

Lambda functions

API calls made from within Lambda functions may also fail. We address these errors by using automatic retries in the Python requests library. Single Lambda invocations are limited to a maximum of 15 minutes, so prolonged retries can result in AWS terminating the Lambda execution. In cases where a Lambda function is canceled, CloudWatch alerts notify our on-call team about failed detection runs, allowing them to handle these cases manually. Fortunately, such instances are rare (occurring less than once every 2-3 months).

Support for LLM

Using large language models (LLMs) for detections is a hot topic. LLMs offer vast possibilities for querying unstructured text data, but we must ensure pipeline stability to prevent LLM hallucinations from disrupting data processing and alerting. Therefore, we use LLMs cautiously, as tools to enhance context rather than make decisions. This approach is beneficial at Rippling, where frequent codebase changes can lead to static detections causing false positives. We ask LLMs if an alert correlates with recent changes, attaching relevant PR links to alert outputs, thus reducing manual alert handling time. How we use LLMs will be the focus of a future blog post in this series—more to come in the future.

Conclusion

In this post, we explored the crucial components and strategies for building a robust detection engine within our SIEM system. We emphasized the importance of consistency, scalability, efficient alert handling, and automation. By leveraging technologies like Snowflake, AWS Lambda, and Terraform, we have created a system that not only meets our standards but also remains cost-effective and highly functional. Our approach ensures that detections are reliable, alerts are managed efficiently, and the entire pipeline is scalable and adaptable to our evolving needs. 

Comprehensive integration and automation within our SIEM platform enable us to maintain robust security measures, ensuring that threats are detected promptly, responses are swift and accurate, and sensitive data is protected from breaches, all while minimizing manual efforts and optimizing resource usage.

Disclaimer: Rippling and its affiliates do not provide tax, accounting, or legal advice. This material has been prepared for informational purposes only, and is not intended to provide or be relied on for tax, accounting, or legal advice. You should consult your own tax, accounting, and legal advisors before engaging in any related activities or transactions.

last edited: August 14, 2024

The Author

Piotr Szwajkowski

Staff Security Engineer

Piotr serves as the Staff Security Engineer on Rippling's Security Operations Team, where he specializes in developing detection strategies and responding to security incidents.