Available datasets - Snort as an example - Detecting network intrusions

3.4 Snort as an example

4.1.2 Available datasets

In this section we will briey explain some of the datasets for testing IDS's.

1http://pytbull.sourceforge.net/

2http://www.backtrack-linux.org/

3http://www.kali.org/

4http://www.metasploit.com/

4.1.2.1 KDD cup 1999 dataset

The 1998 DARPA Intrusion Detection Evaluation Program was prepared and managed by MIT Lincoln Labs (Nadiammai et al. [34]). The objective was to survey and evaluate research in intrusion detection. A standard set of data in-cludes a wide variety of intrusions simulated in a military network environment.

The DARPA 1998 dataset includes training data with seven weeks of network trac and two weeks of testing data providing two million connection records.

A connection is a sequence of TCP packets starting and ending at some well dened times, between source IP address to a target IP address with some well dened protocol. Each connection is categorized as normal, or as an attack, with one specic attack type. The training dataset is classied into ve subsets namely Denial of service attack, Remote to Local attack, User to Root attack, Probe attacks and normal data. Each record is categorized as normal or attack, with exactly one particular attack type. They are classied as follows:

• DOS (Denial of service attack) Denial of service (DOS) is class of attack where an attacker makes a computing or memory resource too busy or too full to handle legitimate requests, thus denying legitimate user access to a machine.

• R2L (Remote to local (user) attack) A remote to local (R2L) attack is a class of attacks where an attacker sends packets to a machine over network, then exploits the machine's vulnerability to illegally gain local access to a machine.

• U2R (User to root attack) User to root (U2R) attacks is a class of attacks where an attacker starts with access to a normal user account on the system and is able to exploit vulnerability to gain root access to the system.

• Probing (Surveillance and other probing) Probing is class of attacks where an attacker scans a network to gather information or nd known vulnerabilities. An attacker with map of machine and services that are available on a network can use the information to notice for exploit.

4.1.2.2 NSL-KDD dataset

NSL-KDD is a data set suggested to solve some of the inherent problems of the KDD'99 dataset ⁵. Furthermore, the number of records in the NSL-KDD train and test sets are reasonable. This advantage makes it aordable to run

5http://nsl.cs.unb.ca/NSL-KDD/

the experiments on the complete set without the need to randomly select a small portion. Consequently, evaluation results of dierent research work will be consistent and comparable.

The NSL-KDD dataset has the following advantages over the original KDD data set:

• It does not include redundant records in the train set, so the classiers will not be biased towards more frequent records.

• There are no duplicate records in the proposed test sets; therefore, the performance of the learners are not biased by the methods which have better detection rates on frequent records.

• The number of selected records from each of level diculty group is in-versely proportional to the percentage of records in the original KDD data set. As a result, the classication rates of distinct machine learning meth-ods vary in a wider range, which makes it more ecient to have an accurate evaluation of dierent learning techniques.

4.1.2.3 UNB ISCX Intrusion Detection evaluation dataset

In network intrusion detection system(IDS), anomaly-based approaches in par-ticular suer from accurate evaluation, comparison, and deployment which orig-inates from the scarcity of adequate datasets. Many such datasets are internal and cannot be shared due to privacy issues, others are heavily anonymized and do not reect current trends, or they lack certain statistical characteristics.

These deciencies are primarily the reasons why a perfect dataset is yet to exist.

Thus, researchers must resort to datasets which they can obtain that are often suboptimal. As network behaviors and patterns change and intrusions evolve, it has very much become necessary to move away from static and one-time datasets toward more dynamically generated datasets which not only reect the current trac compositions and intrusions, but are also modiable, extensible, and reproducible.

At ISCX ⁶, a systematic approach to generate the required datasets is intro-duced to address this need. The underlying notion is based on the concept of proles which contain detailed descriptions of intrusions and abstract distribu-tion models for applicadistribu-tions, protocols, or lower level network entities. Real traces are analyzed to create proles for agents that generate real trac for HTTP, SMTP, SSH, IMAP, POP3, and FTP. In this regard, a set of guide-lines is established to outline valid datasets, which set the basis for generating

6Information Security Centre of Excellence: http://www.iscx.ca/

proles. These guidelines are vital for the eectiveness of the dataset in terms of realism, evaluation capabilities, total capture, completeness, and malicious activity. The proles are then employed in an experiment to generate the de-sirable dataset in a testbed environment. Various multi-stage attacks scenarios were subsequently carried out to supply the anomalous portion of the dataset.

The intend for this dataset is to assist various researchers in acquiring datasets of this kind for testing, evaluation, and comparison purposes, through sharing the generated datasets and proles.

To simulate user behavior, the behaviors of their Center's users were abstracted into proles. Agents were then programmed to execute them, eectively mim-icking user activity. Attack scenarios were then designed and executed to ex-press real-world cases of malicious behavior. They were applied in real-time from physical devices via human assistance; therefore, avoiding any unintended characteristics of post-merging network attacks with real-time background traf-c. The resulting arrangement has the obvious benet of allowing the network traces to be labeled. This is believed to simplify the evaluation of intrusion detection systems and provide more realistic and comprehensive benchmarks⁷.

In document Detecting network intrusions (Sider 88-91)