Sophos, ReversingLabs Release 20 Million Sample Dataset for Malware Research

Sophos and ReversingLabs on Monday announced SoReL-20M, a database of 20 million Windows Portable Executable files, including 10 million malware samples.


Aimed at driving security improvements across the industry, the database provides metadata, labels, and features for the files within, and enables interested parties to download the available malware samples for further research.


Containing a curated and labeled set of samples and relevant metadata, the publicly-accessible dataset is expected to help accelerate machine learning research for malware detection.


Although machine learning models are built on data, the field of security lacks a standard, large-scale dataset that all types of users (ranging from independent researchers to laboratories and corporations) can easily access, which has so far slowed down advancement, Sophos argues.


“Obtaining a large number of curated, labeled samples is both expensive and challenging, and sharing data sets is often difficult due to issues around intellectual property and the risk of providing malicious software to unknown third parties. As a consequence, most published papers on malware detection work on private, internal datasets, with results that cannot be directly compared to each other,” the company says.


A production-scale dataset covering 20 million samples, including 10 million disarmed pieces of malware, the SoReL-20M dataset aims to address the problem.


For each sample, the dataset includes features that have been extracted based on the EMBER 2.0 dataset, labels, detection metadata, and complete binaries for the included malware samples.


Additionally, PyTorch and LightGBM models that have already been trained on this data as baselines are provided, along with scripts needed to load and iterate over the data, as well as to load, train, and test the models.


Given that the malw ..

Support the originator by clicking the read the rest link below.