Inside Look: How SenseOn’s Security Engineering Team Perfects Their Detection Analytics
Want to know how we do it, what tools we use and the problems we’ve solved along the way? Read on!
In the Security Engineering team at SenseOn, we are constantly seeking methods to achieve big goals with minimal input. This is where our focus on automation becomes essential. If you peek inside our team, you’ll find a trove of unique and effective automation tools that aid us on our journey. From small Python scripts to GitHub actions to Slack bots, we have it all. These tools are generally small enough in their functionality to avoid the pitfalls illustrated in that XKCD comic we all think about. Except for one project: our automated detection testing framework. While mostly robust, it often requires significant ongoing development because we keep developing new ideas!
Testing is an essential component of any development process. Our engineering teams spend a significant amount of time testing their code to ensure it’s ready for production. We believe we should treat detections in the same way.
Our primary goal for detection testing is to ensure the quality and consistency of the detections. A secondary goal is to ensure the detections don’t negatively impact our product’s performance. Additionally, we found that the data generated from the test cases can aid in developing new detections, though we don’t want to rely solely on test cases for this; they are simply additional data inputs.
When evaluating the performance of a detection, we look for instances where a detection has a high false positive rate. For example, if we’ve designed a new detection to look for changes to the registry but we see a significant increase in unrelated observations, we will remove the detection from circulation and revise it. We also aim to avoid detections underperforming, which could result in missing potential true positives.
Overview
Our testing architecture integrates various open-source tools to create a repeatable and robust testing methodology.
Packer - Builds out base virtual machines
Terraform - Manages the virtual machine state
Ansible - Executes the testing framework, performs some OS management
GitHub actions - Used as a queue to handle test executions
PostgreSQL & PostgREST - The data storage and API solution
Grafana - Displays all our data
Atomic Red Team - Open source security test cases
Custom Python tooling
The Server
All Security Engineering testing and research is conducted on a single dedicated server using a generic hypervisor instead of using cloud based, on-demand infrastructure. This method was found to be the most cost-effective due to the high number of test cases, the variety of operating systems we test on, and the frequency of testing. All these factors require substantial compute power, making on-demand instances quickly add up in cost.
During a full test run, we hit the upper limit of CPU capacity on the server. When executing a Windows 10 test run we limit execution to 4 concurrent virtual machines and we see CPU utilisation peak at over 100%. This is a bottleneck we want to fix, but are limited by the fact that even a simple Windows 10 machine uses significantly more CPU than an equivalent Linux machine. Couple the high base CPU usage with the CPU required to execute atomics and the resource is quickly consumed.
For reference, the server specification we use is:
Intel® Core™ i9-13900
128GB DDR5 ECC RAM
3 x 2 TB NVMe SSD Datacenter Edition
Tooling in Depth
Packer
We use Packer to create all our base images required for testing. Our goal is to have a base image that includes everything needed for the test run, minimising the time spent installing packages on the virtual machine during the execution of a test.
We execute Packer builds weekly, destroying the old build and replacing it with a fully updated version. This guarantees that we’re always using the latest packages and system updates, preventing us from wasting valuable execution time updating a virtual machine.
You may be wondering why we didn’t go down the container route. This was one of the first iterations of our testing suite. It worked for the most part, but we soon found that due to the stripped-down nature of a container, we’d end up spending a lot of time installing test dependencies and then storing large containers. Additionally, as our customers are using full-fat virtual machines, this would make our test environments less representative of production.
Terraform
Out of all the tools we use, this one is probably the most underutilised. Terraform is incredibly powerful at infrastructure management. We use it extensively as an organisation, but for Security Engineering, it has one small but important role: clone the virtual machines, switch them on, pass Ansible the IP addresses, and store a state so virtual machines can be destroyed.
We’re looking at methods to remove our reliance on Terraform, but at the moment it works, so we’re leaving it alone.
Atomic Red Team
The Atomic Red Team project is an incredible initiative, and we’re proud to say we’ve contributed to it and will continue to do so.
At the time of writing, the project includes 1057 tests for Windows, 349 tests for Linux, and 208 tests for macOS covering 59%, 52% and 52% of the MITRE ATT&CK framework respectively. This is an impressive number of tests, and the community should be proud. However, meeting our project goals of running all tests made things a bit tricky and required some time-consuming analysis to ensure a full execution without issues.
Some of the pitfalls we had to navigate included:
Atomics that shut down, restart, suspend, or perform destructive actions to the virtual machine were removed from the test pool. In the future, we would like to bring them back into the test pool either by executing them on their own machine or by Ansible catching and managing the action.
Atomics that attempt to download dependencies that no longer exist. We tried to fix the dependencies and opened a pull request if successful. If not, we disabled the test.
Atomics that only work on specific OS configurations, such as a Domain Controller. These tests are currently disabled, but we’re working towards enabling them soon.
Atomics that didn’t execute correctly and caused Ansible to crash. These were also disabled.
Antimalware Scan Interface (AMSI) interfering with execution. Atomic Red Team often uses malicious PowerShell scripts, which AMSI blocks. We disable AMSI at the registry level but must continuously check and ensure it remains disabled because Windows re-enables it.
Ansible
Within our workflow, Ansible is tasked with a significant amount of work. Beyond its traditional role in configuration management, we use it to orchestrate the execution of Atomic Red Team.
During the execution flow, Ansible makes a few configuration changes to the cloned virtual machine, assigning it a unique hostname and ensuring everything has been correctly installed by Packer. After completing these tasks, it proceeds to execute Atomic Red Team.
The team at Red Canary have provided the Atomic Red Team execution Ansible role, which simplifies setting up an Ansible role to execute Atomic Red Team. As our requirements grew, so did the Ansible role. Most of our modifications are geared towards managing the SenseOn Universal Sensor, enabling us to correlate test executions with test results i.e. observations raised on the SenseOn platform. Without these changes, identifying when a test was executed and subsequently observed would be nearly impossible.
After performing all the ART test cases on the target virtual machine, Ansible retrieves the execution log from the virtual machine and runs our management application to correlate the executed tests with the observed results.
GitHub Actions
We love GitHub Actions. From an automation standpoint, it’s magical.
Want to run some code on a schedule without deploying a server? GitHub Actions.
Want to ingest vulnerabilities from your scanning tool into Jira? GitHub Actions.
Want to run an analytic testing platform without deploying complex middleware? GitHub Actions.
Deploying local runners expands the possibilities for GitHub Actions. We can execute tasks you wouldn’t trust on a shared runner, mainly because granting open access to sensitive systems could open up systems to potential compromise.
Our management application handles calls to the GitHub Actions API to execute workflows and depending on the state of a test execution, the relevant workflow is triggered. We’ve tried to keep the workflows as simple as possible; however, some complexity and variable interpolation are unavoidable.
PostgreSQL and PostgREST
PostgreSQL was chosen as the data store due to existing experience within Security Engineering, the ease of setup and configuration as well as the additional functionality PostgreSQL has with handling JSON data types. The use of JSON data types was an early design decision that didn’t work out. We found it too limiting in how we wanted to manipulate our data, so returned to a standard relational database structure.
The magic happens with PostgREST. A colleague introduced postgREST to us after a conversation about a challenge we were facing inserting data into the database without the requirement for developing a server element. They quickly responded with, “Check out PostgREST; it will define a REST API based on the DB schema automatically!” It’s great for our use case, reducing the complexity of developing a server-side element for our management application, as all we want to do is process the data and push it into the DB from anywhere without needing to perform a DB connection.
Grafana
All this data is valuable, but viewing it in an actionable way is even more crucial, and that’s where Grafana comes in. With support from our DevOps team, we developed a dashboard that displays:
Test status over time - Pass/Fail and a breakdown of each SenseOn component
Test execution stats - Did all the tests execute, or did any fail during a specific run?
Test pass rate - Overall test pass rate, broken down into MITRE tactics
Status history graphs - These show the execution history status of an atomic
The image to the left displays the status history visualisation at the atomic level. None of these atomics are functioning as expected, with observations being raised inconsistently. This graph is configured to show only changes, so if it’s all green or all red, it isn’t displayed. To assist with investigations, we’ve configured Grafana to provide a pop-up when a status box is selected. This pop-up allows the engineer to view the test execution status and the corresponding observations, or lack thereof.
It’s great to see the increase over time as our analytics improve and we introduce new methods to detect threats on the platform. Grafana acts as an early warning system for potential issues in our detections or new code because our testing platform is classified as a staging environment. Occasionally, we’ll notice the graph showing the number of detections dropping to 0. While this is a concern, it’s not too drastic since it doesn’t affect customers. This provides us with some breathing room and time to investigate the issue before the code is pushed out, ensuring the detection count returns to normal first.
Management Application
Each of these tools is brilliant on its own but cannot work together, so we need some glue - something that can sit between them to execute, manage, and collate data. This is where our management tool comes in.
Written in Python, the tool handles a variety of tasks in our lab and is designed with a CLI to be run when required. It acts as the interface between all the tools described as well as the database.
The diagram above shows where the management tool fits in
A user requests a test to be executed.
The management application creates a test plan, containing what Atomic tests will be executed and how many virtual machines are required, then stores it in the database.
The management application interacts with the GitHub REST API to invoke a GitHub action workflow.
The GitHub Action local runner instructs Terraform to create and store a state for the required number of virtual machines.
The GitHub Action workflow executed the management application with the system details provided by Terraform.
The management application updates the test plan within the database, enriching the system information with the data provided by Terraform.
GitHub action executes the Atomic Red Team test run with Ansible.
Once the test has been completed, Ansible will call the management application and submit the test results.
The management application interacts with the SenseOn API with all the details it has about the tests being run and system information, and gathers observations that have been raised. It then formats them to be consistent.
The management application updates the database with the test results, the observation results and any other metadata we deem necessary.
The user can manipulate and view dashboards displaying data about test performance.
The management application will use the execution times provided by the atomic testing log, the virtual machine hostname provided by Ansible, and the GUID of the atomic test which has been injected into the telemetry before leaving the test device to programmatically determine which observations raised on the SenseOn platform apply to each atomic GUID.
The injection of the test GUID into the telemetry does not impact or affect how detections work and is only used so we can match executions to observations.
For a test case to be marked as successful it must return an observation from the SenseOn platform, telemetry alone is not sufficient for us to be marked as a successful detection as this would mean our customer base has to perform manual searching of their telemetry and increases the barrier to determining if a threat has been executed in their environment.
Our Experience
Time to Execute
As Ansible is designed to execute its task as fast as possible and its primary use case is to provision and configure infrastructure the normal Ansible user will want that performed in as short a time as possible.
However, for each test case that is executed on a virtual machine, the following happens:
The SenseOn Universal Sensor is stopped
The atomic GUID placed where it will be injected into the telemetry
The SenseOn Universal Sensor is started
The atomic test is executed
That functionality means there is a race condition between the SenseOn Universal Sensor starting and the atomic test executing, this isn’t a common occurrence and is only an issue because of the speed Ansible executes its commands. As a result, we slowed execution down by sleeping Ansible for a second or two between starting the SenseOn Universal Sensor and executing the test, just so we know that one command has finished executing before the test is executed. This results in a full test run executing on Windows 10 taking around 6 hours.
To complete test runs faster, we can use our management tool to create a test plan targeting specific atomics, technique numbers or a full tactic. This is then used during the detection development process, with each detection mapped to the MITRE ATT&CK framework. From that data the management tool can be used to create a test plan that covers what has been developed. This provides the team with quick and actionable feedback on the performance of our detections during the development lifecycle.
What we can't see from the targeted approach is the wider performance impact. As a result, analytical detection releases do wait for a full run to be completed overnight, this doesn’t apply to all our types of detections, with our Protection and Threat Intelligence detections being deployed immediately as these don’t require the same level of validation. We talk more about how we want to address this below to speed up our release cycle.
Maintenance
As we explained above, this toolset is one within Security Engineering that requires a bit more TLC. The infrastructure element ticks along nicely as we follow the methodology of treating our systems as cattle and not pets so at any point we can destroy and rebuild, though our hypervisor is not currently defined in code and requires a small amount of hands-on maintenance with patching and if we need to implement modifications.
Atomic Red Team are continuously releasing new atomics, and these new atomics could fall foul of the pitfalls we outlined in the Atomic Red Team section above. To ensure all atomics work correctly within our test suite and don’t have an adverse impact we conduct a weekly review session of all new atomics, any that will cause an impact are removed from the test pool and a comment applied noting why it has been removed.
Future plans
Our focus now is to reduce the amount of time it takes for a full execution. As this project has grown organically the original methods didn’t account for what we currently expect. We are looking at how we can make use of libvirt automation and capabilities to improve the speed of virtual machine management. This will have a big impact on how quickly we can develop detections, as well as being able to pack more operating system executions onto the server. We prefer this approach to “kicking the can down the road” through vertical scaling of our test server, although we accept this will be necessary at some point.
Some of the categorisations in Atomic Red Team mean that it can be difficult to easily select atomics that won’t harm the virtual machine. We’re considering contributing a change to the project that will help with this, and expect it will be of use to a wider range of ART users.
Finally, we want to add support for more operating systems and operating system configurations.
But wait there’s more!
As well as the daily automated testing we work with specialist third party testing organisations to test our product.
Last year we worked with SELabs to take part in their Enterprise Advanced Security program, achieving a AAA rating. The testing that SELabs performed was in-depth, thorough and gave us valuable feedback we used to hone our detection capability even more.
Throughout 2024 we are participating in the AV-Comparatives main test series. AV-Comparatives is an independent organisation offering systematic testing that checks whether security software lives up to its promises. Using one of the largest collections of threat samples worldwide, it creates a real-world environment for truly accurate testing. We are pleased to say that SenseOn has been awarded the AV-Comparatives’ Approved Business Product Award for July 2024. More information can be found on the AV-Comparatives website.
Internally we also perform simulated adversary exercises. These involve using our management application to build and deploy an enterprise-like network with SenseOn. We then attack that network using either our attack plans or attack plans used by the likes of the Centre for Threat Informed Defence.
Conclusion
This type of detection testing could not be possible without Atomic Red Team. The project's mission is to provide atomic security tests to the masses and it has a huge impact on our ability to quickly iterate in our detection development. We have no doubts that they have helped other product organisations and we strongly recommend others contribute to the project. The core team is brilliant and as a bonus, the first contribution will get you a t-shirt!
Throughout this post, we have spoken about the tools we’ve used to build a platform to help ensure we’re consistently developing detections to a high quality. While the testing improvement process is never-ending, it’s important to recognise that improvement happens incrementally over time and that acknowledging how far we’ve come is as important as thinking about all the work we’ve still got left to do. 18 months ago this framework didn’t exist, and 18 months from now its impact on our detection development will be significantly greater than it is now!
If you'd like to find out more about how SenseOn can protect your employees and data across endpoints, network and cloud using our AI-powered, human-backed security platform, book a demo with the team today.