This directory contains the original Malheur dataset. The dataset has been lost around 2012 and was re-assembled with the help of Juan Tapiador in 2016. The dataset contains the recorded behavior of malicious software (malware) and has been used for developing methods for classifying and clustering malware behavior (see the JCS article from 2011).
The dataset consists of a reference dataset and multiple application datasets. Each of these sets contains behavior reports of malware programs in three different formats. These formats are the original CWSandbox XML format, a sequential version of the CWSandbox format and the so-called MIST format -- a weird representation of behavior that was considered cool in the late 2000s.
A detailed description of the dataset and the MIST format are available in the following two papers:
Automatic Analysis of Malware Behavior using Machine Learning Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten Holz. Journal of Computer Security (JCS), 19 (4) 639-668, 2011.
A Malware Instruction Set for Behavior-Based Analysis Philipp Trinius, Carsten Willems, Thorsten Holz, and Konrad Rieck TR-2009-07, University of Mannheim, 2009
Additionally, each dataset contains a list of MD5 hashes that can be used to retrieve the original malware programs from a malware archive, such as VirusTotal. Moreover, for each application dataset there also is a file containing clustering results computed using the tool Malheur (http://www.mlsec.org/malheur).
The reference dataset contains behavior reports of malware samples that have been collected between 2006 and 2009 from a variety of sources. All samples have been assigned to a known malware family using majority voting of six anti-virus products. Although anti-virus labels suffer from inconsistency, we expect the selection using different scanners to be reasonable consistent. To compensate for the skewed distribution of samples, families with less than 20 samples have been discarded and the maximum contribution of each family has been restricted to 300 samples. Each of the samples has been executed and monitored using the analysis environment CWSandbox, resulting in a total of 3.131 behavior reports.
The application data consists of seven chunks of behavior reports from malware binaries obtained from the anti-malware vendor Sunbelt Software. The binaries correspond to malware collected during seven consecutive days in August 2009 and originate from a variety of sources. Sunbelt Software used these very samples to create and update signatures for their VIPRE anti-malware product as well as for their security data feed ThreatTrack. The complete application data set consists of 33.698 behavior reports. The samples have been loosely assigned to families using the anti-virus scanner from Kaspersky 8 weeks after their submission to Sunbelt Software.
- Application Dataset 20090801
- Application Dataset 20090802
- Application Dataset 20090803
- Application Dataset 20090804
- Application Dataset 20090805
- Application Dataset 20090806
- Application Dataset 20090807
Where is the rest of the dataset? The original dataset was published long time ago in 2011 along with the JCS article. The files were hosted at the University of Mannheim, Germany. With the move of the local security research group to the University of Erlangen, however, the server storing the data went offline and now likely serves a different purpose. This directory contains all of the data that has been saved or restored.
Where are the benign behavior reports? The Malheur dataset has never been designed for detecting malware. The main purpose of the dataset is to develop methods for clustering malware according to its behavior. As a consequence, benign behavior has never been recorded in our experiments and should not be mixed with this data. Note that benign programs usually require interaction with a user and thus are not suitable for monitoring in an automated sandbox environment.
Can I use the dataset in my research? Yes. However, keep in mind that this is a very old dataset. It is stored in a weird format and contains the program behavior of malware families that died several years ago. You can do some research with this data, but it is highly recommended that you look for something better.
Konrad Rieck (firstname.lastname@example.org)