How Huntress Transformed Its Detection Engine

Scaling data analysis to meet the demands of a rapidly growing company is a daunting challenge. This blog post details Huntress’ journey while facing this very dilemma.

Read on to learn how we transformed Huntress’ data and log analysis by transitioning to a custom detection engine, leveraging OpenSearch hosted by AWS.

The Problem

At Huntress, we collect, analyze, and store the logs from every monitored endpoint utilizing a single but parallelized pipeline within a monolithic application. With this setup, we can provide an “at least once” guarantee and easily access additional context to enrich the collected data.

However, as we’ve grown as a company, the system has encountered some growing pains in the form of scaling issues and some components becoming cost-prohibitive. For argument's sake, let’s assume this data amounts to approximately 900TB per month or approximately 30TB per day across our more than two million endpoints. For context, 30TB is about 1400 hours of high-quality 4k content.

detection-engine-data

That’s a TON of data. So, how does Huntress analyze it all?

Initial Solutions

As with most endpoint detection and response (EDR) companies, we utilize an installable agent for data collection. Our agent is very slim; we only allow the agent to collect the needed data and then upload it to S3. These uploaded files are referred to as surveys. They contain all new data collected since the last check-in. Then, a background task will retrieve these survey files and start the processing job. The processing job converts the single survey file into a stream of documents. It then enriches the documents and stores key fields in our relational database before sending them to Elasticsearch. The data within the survey file loosely follows the Elastic Common Schema format to eliminate the need to transform them before Elasticsearch indexing.

Detection-Engine-01

Once the documents are in Elasticsearch, they are available to other tools within the Elastic Stack and our Security Operations Center (SOC) analysts. This interoperability enables us to utilize the Security app within Kibana to execute the detection rules. These rules are written in the Sigma format and then converted to a format the Security app can understand. The Security app will execute the rules regularly based on our confidence the rule will only find malicious behavior. The higher our confidence in the rule, the more often the rule is executed. When a document matches a rule, a new document is created. We call this new document a signal. This signal contains the original document and is enriched with information about the rule the document matched. This signal is stored in Elasticsearch and forwarded to our huntress.io portal.

Once the signal is stored in the portal’s database, our SOC analysts can begin their investigations. During their investigation, analysts may pivot back to Elasticsearch to find additional context or to hunt for similar activity from other endpoints.

Why did we solve our analysis problem this way? Primarily, we aligned with our virtuous cycle by implementing a solution that would scale with our near-term goals and provide disproportionate value. At the time, we had less than a million agents and operated with a small engineering team. But times have changed…

Growing Pains

Detection-Engine-02

We are a much larger company and our install base has more than doubled, so naturally, we experienced some growing pains. Firstly, the Security app is inherently backward-looking. The documents had to be indexed before they could be compared against our detection engine. In the best-case scenario, we could detect a signal one minute after it was sent to Elasticsearch. But in the worst-case scenario, it could take us several hours. It all depended on how often the detection rules were executed. To make the Security app design even worse, it limits the rule results to the first 100 matching documents, limiting the output from a single run to 100 signals. In some of our lower fidelity rules, we always hit this limit. This prevented us from combining those low-fidelity rules to produce a high-fidelity signal.

The next issue we discovered had some shared responsibility. When Elastic needs to complete maintenance on the Elasticsearch component of their offering, they would remove the affected node from our cluster and provide us with a new node. This action wouldn’t be a huge problem for clients with redundancy in their data. The Elasticsearch service would see a node was removed and use the second copy of any missing data to recreate it. However, we did not keep replica copies of our data. When we lost the node, we lost the data contained on the node. This also caused intermittent indexing issues. Correcting this issue meant we had to double our node count.

The final and most crucial issue with this setup was the overall cost. Several factors contribute to the service cost. When we doubled the node count to provide redundancy, we also doubled the cost. Also, Elastic does not allow customers to customize nodes to their use case. They offer high-level customization based on workload type, which modifies the storage-to-RAM ratios. But it is impossible to, for example, add additional storage to existing nodes.

detection-engion-quote

To make matters worse, customers who deploy in multiple availability zones, as Huntress does, must add nodes to each availability zone proportionally. Finally, Elastic’s hosting provider, AWS, charges them for inter-availability zone communication. Elastic adds a markup to that cost and passes it to the customer. At our data volume, one-third of our monthly cost went to data transfer.

Seeking Other Solutions

Toward the end of 2022, we explored alternative architectures to solve our analysis problem. We explored self-hosting all components of the Elastic Stack. We explored implementing a message queue between the portal and Elasticsearch, allowing us to trickle events during the day. Finally, we explored alternative logging solutions; however, none of these fit well enough to replace what we had.

Ultimately, we landed on writing a custom detection engine to provide signal generation and utilizing OpenSearch hosted by AWS to provide the search component.

Detection-Engine-03

This new design modifies the document flow after we store key fields in our relational database. In addition to sending the documents to the new OpenSearch cluster, we send them to an event stream where multiple instances of the detection engine read them. By splitting the document flow and duplicating the documents, we retain the ability for our SOC and threat analysts to search our data in a familiar interface while also improving our present and future detection capabilities.

Closing Thoughts

Because we designed our detection engine, we eliminated all the issues we experienced with the Kibana Security app. We maintained the “at least once” guarantee by implementing a message queuing system between the portal and the detection engine. This also means we process the data as a stream, eliminating the delay introduced by the query scheduling and the 100 results limit.

The hosted OpenSearch service allows us to reduce costs by reducing data transfer costs and providing customizable node configurations. Unlike Elastic, AWS does not charge for inter-availability zone transfer when OpenSearch is deployed using its hosted service. They also allow customers to customize the size of the block storage attached to each node. This allowed us to reduce the number of nodes while providing the same retention period.

In all likelihood, this is not the final iteration of our analysis pipeline. As Huntress continues to grow, so will our capabilities to serve our customers and protect the small and medium-sized businesses who need it most.

Want to see the Huntress detection engine in action? Schedule a personalized demo to take a look at Huntress under the hood.