on
Quickly Analyzing Security Datasets with DuckDB
DuckDB has been making waves in Data Analysis and Data Engineering circles, circles that security professionals in detection and response should strive to be a part of, but that’s for a different rant. There’s no sense in repeating the benefits from DuckDB since it’s outlined nicely in this Why DuckDB page. On the main DuckDB page, there is a large quote:
All the benefits of a database, none of the hassle.
There are some use cases where data is not ingested and available in a data platform and only available as a JSON or CSV file. In these cases, I tend to spin up a Jupyter notebook, load the files into a Pandas dataframe, and then perform my analysis. It works great. After reading a bit about DuckDB and the various use cases, I knew it could also fufill my adhoc data analysis usecase.I was able to get DuckDB installed by running brew install duckdb
. Running duckdb
initates the DuckDB shell.
☁ ~ duckdb
v0.9.0 0d84ccf478
Enter ".help" for usage hints.
Connected to a transient in-memory database.
I needed some security centric logs which are available from Security Datasets project. I realized there were some AWS datasets, which would be perfect to use. I suspected they were CloudTrail logs which were probably a big JSON file. I downloaded the ec2_proxy_s3_exfiltration
dataset, and my hunch was right.
Once I unzipped the file, I ran duckdb
in the same folder as the extracted json file. I decided to quickly query the JSON file.
D SELECT * FROM 'ec2_proxy_s3_exfiltration_2020-09-14011940.json' LIMIT 10;
┌──────────────────────┬──────────────────────┬───────────┬────────────┬───┬──────────────────────┬───────────────┬──────────────────────┬─────────────────┐
│ requestParameters │ userAgent │ awsRegion │ eventType │ … │ resources │ eventCategory │ additionalEventData │ managementEvent │
│ struct(describeins… │ varchar │ varchar │ varchar │ │ struct(arn varchar… │ varchar │ struct(bytestransf… │ boolean │
├──────────────────────┼──────────────────────┼───────────┼────────────┼───┼──────────────────────┼───────────────┼──────────────────────┼─────────────────┤
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
│ {'DescribeInstance… │ console.ec2.amazon… │ us-east-1 │ AwsApiCall │ … │ │ │ │ │
├──────────────────────┴──────────────────────┴───────────┴────────────┴───┴──────────────────────┴───────────────┴──────────────────────┴─────────────────┤
│ 10 rows 22 columns (8 shown) │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
This is significant because you can interact with a large JSON file using SQL instead of performing witchcraft with jq
. I decided to actually turn the JSON into a DuckDB table. It was very straight forward to do without having to define a schema.
D CREATE TABLE ec2_proxy_tbl AS SELECT * FROM read_json_auto('ec2_proxy_s3_exfiltration_2020-09-14011940.json');
D show tables;
┌───────────────┐
│ name │
│ varchar │
├───────────────┤
│ ec2_proxy_tbl │
└───────────────┘
D describe ec2_proxy_tbl;
┌─────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int32 │
├─────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────┼─────────┼─────────┼───────┤
│ requestParameters │ STRUCT(DescribeInstanceTypesRequest STRUCT(NextToken VARCHAR, MaxResults BIGINT), filterSet STRUCT(items STRUCT(valueSet STRUCT(items STRUCT("value"… │ YES │ │ │ │
│ userAgent │ VARCHAR │ YES │ │ │ │
│ awsRegion │ VARCHAR │ YES │ │ │ │
│ eventType │ VARCHAR │ YES │ │ │ │
│ @version │ BIGINT │ YES │ │ │ │
│ userIdentity │ STRUCT(arn VARCHAR, "type" VARCHAR, userName VARCHAR, sessionContext STRUCT(webIdFederationData JSON, sessionIssuer STRUCT(arn VARCHAR, userName VAR… │ YES │ │ │ │
│ recipientAccountId │ BIGINT │ YES │ │ │ │
│ responseElements │ STRUCT(credentials STRUCT(sessionToken VARCHAR, expiration VARCHAR, accessKeyId VARCHAR)) │ YES │ │ │ │
│ eventName │ VARCHAR │ YES │ │ │ │
│ sourceIPAddress │ VARCHAR │ YES │ │ │ │
│ eventSource │ VARCHAR │ YES │ │ │ │
│ requestID │ VARCHAR │ YES │ │ │ │
│ @timestamp │ VARCHAR │ YES │ │ │ │
│ eventID │ UUID │ YES │ │ │ │
│ eventVersion │ VARCHAR │ YES │ │ │ │
│ apiVersion │ DATE │ YES │ │ │ │
│ readOnly │ BOOLEAN │ YES │ │ │ │
│ sharedEventID │ UUID │ YES │ │ │ │
│ resources │ STRUCT(ARN VARCHAR, accountId BIGINT, "type" VARCHAR, ARNPrefix VARCHAR)[] │ YES │ │ │ │
│ eventCategory │ VARCHAR │ YES │ │ │ │
│ additionalEventData │ STRUCT(bytesTransferredOut DOUBLE, AuthenticationMethod VARCHAR, CipherSuite VARCHAR, SignatureVersion VARCHAR, bytesTransferredIn DOUBLE, "x-amz-id… │ YES │ │ │ │
│ managementEvent │ BOOLEAN │ YES │ │ │ │
├─────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────┴─────────┴─────────┴───────┤
│ 22 rows 6 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
It seem to have done a decent job inferring the correct datatypes and even dealt with the nested JSON of CloudTrail fields like requestParameters
and additionalEventData
. I performed a quick analysis, just to test out the newly created table. First, I wanted to do a count of all the User Agents.
D SELECT userAgent, COUNT(*)
> FROM ec2_proxy_tbl
> GROUP BY userAgent;
┌──────────────────────────────────────────────────────────────────────────────────────────────┬──────────────┐
│ userAgent │ count_star() │
│ varchar │ int64 │
├──────────────────────────────────────────────────────────────────────────────────────────────┼──────────────┤
│ console.ec2.amazonaws.com │ 85 │
│ EC2ConsoleFrontend, aws-internal/3 aws-sdk-java/1.11.848 Linux/4.9.217-0.1.ac.205.84.332.m… │ 2 │
│ [aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59] │ 11 │
│ ec2.amazonaws.com │ 5 │
└──────────────────────────────────────────────────────────────────────────────────────────────┴──────────────┘
I wanted to see the actions performed via AWS CLI.
D SELECT eventName, sourceIPAddress FROM ec2_proxy_tbl WHERE userAgent='[aws-cli/1.18.136 Python/3.8.5 Darwin/19.5.0 botocore/1.17.59]' ;
┌─────────────┬─────────────────┐
│ eventName │ sourceIPAddress │
│ varchar │ varchar │
├─────────────┼─────────────────┤
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ GetObject │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListBuckets │ 1.2.3.4 │
│ ListBuckets │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ ListObjects │ 1.2.3.4 │
│ GetObject │ 1.2.3.4 │
├─────────────┴─────────────────┤
│ 11 rows 2 columns │
└───────────────────────────────┘
The point wasn’t to investigate the dataset, I just wanted to play with DuckDB with a dataset I was familiar with. I will definitely be using DuckDB more for these edgecases where I need to investigate datasets. I would encourage those trying to break into detection and response, to take advantage of DuckDB and really learn how to analyze data with SQL.