We plan to collect our own data on a local network. One server will act as the scanned system, while additionally providing regular services and creating normal traffic by communicating with some number of clients. A separate system on a different local network will scan the primary system with multiple port scanning softwares. We will store the regular traffic and the scanned traffic in pcap files. The regular traffic will either be of moderate volume or of heavy volume. Each scanning software will create multiple scanning traffic pcap files. Every scanning traffic file will be merge in with every regular traffic file so that the scans appear to have occurred at some point during the regular traffic. This means that each scanning software will produce some number of scans (S) and be matched with each regular traffic (R). Assuming each software supports vertical and horizontal scans, and passive and aggressive scans, that means there are four ways for each software to scan (4S). Assuming each scan is matched to no traffic, moderate traffic, and heavy traffic (3R), then there will be 12SR total scans to train on. We will set R at a relatively high number (R>1000) in order to avoid overfitting. Our training and testing sets for the neural net will contain an equal percentage of each type of scan from each type of scanning software matched with each type of background traffic.
We will measure the effectiveness by examining the rate at which we get false positives and false negatives, compared to other existing softwares. Additionally, we will compare the speed at which we detect the scan, and at what volume of traffic our software begins to slow down.