DeepCASE Dataset

This research uses two datasets for its evaluation:

Lastline dataset.
HDFS dataset.

Lastline dataset

The real-world Lastline dataset consists of 20 international organizations that use 395 detectors to monitor 388K devices*. This resulted in 10.5M security events for 291 unique types of security events collected over a 5-month period. Events include policy violations (e.g., use of deprecated samba versions, remote desktop protocols, and the Tor browser), signature hits (e.g., Mirai, Ursnif, and Zeus) as well as heuristics on suspicious and malicious activity (e.g., beaconing activity, SQL injection, Shellshock Exploit Attempts and various CVEs). Of the 10.5M security events, a triaging system selected 2.7M events that were likely to be part of an attack. Of these 2.7M likely malicious events, 45.1K security events were confirmed to be part of an attack by security operators, and labeled as ATTACKS. These attacks include known malware, such as the XMRig crypto miner, or remote access Trojans, such as NanoCore. Another 46.4K events were classified as a HIGH security risk (e.g., successful web attacks and exploitation of known vulnerabilities such as CVE-2019-19781); 184.9K events classified as a MEDIUM risk (e.g., attempted binary downloads or less exploited vulnerabilities such as CVE-2020-0601) and 2.4M events as LOW risk (e.g., the use of BitTorrent or Gaming Clients). The remaining 7.8M events were not related to security risks, but were used to give security operators additional information about device activity, and are therefore labeled as INFO.

*These include devices in a bring-your-own-device setting which were only monitored for a small part of the 5 months. Therefore, the average number of 10.5M/388K = 27.06 events generated per device is significantly lower than the earlier reported 170 events per device per day.

Download

NOTE

The Lastline dataset was obtained under an NDA and therefore, unfortunately, we cannot share the dataset.

HDFS dataset

We also evaluate DeepCASE on the HDFS dataset [1] used in the evaluation of the related security log analysis tool DeepLog [2]. This dataset consists of 11.2M system log entries generated by over 200 Amazon EC2 nodes. The dataset was labeled by experts into normal and anomalous events, where 2.9% of events were labeled as anomalous. Unfortunately, this dataset lacks metadata about the risk level of security events and is therefore evaluated in terms of workload reduction, but not in terms of accuracy. Despite containing less information, we use the HDFS dataset to provide a reproducible comparison with state-of-the-art systems.

Download

The HDFS dataset as we used it in our research can be downloaded from https://github.com/wuyifan18/DeepLog/tree/master/data and we also have local copies available in the HDFS directory of this repo.

References

[1] Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. (2009). Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP) (pp. 117–132).

[2] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (CCS) (pp. 1285-1298).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
HDFS		HDFS
README.md		README.md
hdfs_test_abnormal		hdfs_test_abnormal
hdfs_test_normal		hdfs_test_normal
hdfs_train		hdfs_train

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeepCASE Dataset

Lastline dataset

Download

HDFS dataset

Download

References

About

Releases

Packages

ucsb-seclab/DeepCASE-Dataset

Folders and files

Latest commit

History

Repository files navigation

DeepCASE Dataset

Lastline dataset

Download

HDFS dataset

Download

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages