How to train a classification algorithm with multiple samples that represent the class?

Question

Hopefully explaining this is the right way. Apologies if some of it is unclear at all.

I am working with network data and want to use a supervised approach to identify whether a sample (packet) is malicious or not, so a binary classification.

In my head I have a number of rows/samples which represent the packets and in there are features that could be used to identify whether a sample is malicious. A simple one would be say performing a ping of death attack to have samples that have the size of the packet/payload above what would be normal for a ping. This I can see as you would mark the pings of death with a 1 and a normal ping with a 0.

My issue comes when looking at multiple samples that need to be combined in order to identify an attack. For instance with some attacks a solitary sample will not signal an attack but a particular pattern of samples or frequency of samples would. Does anyone know a way of preparing the data or a supervised model you can do this with?

Please add few samples records. – 10xAI Aug 04 '21 at 13:38 — 10xAI, Aug 04 '21 at 13:38

score 0 · Answer 1 · answered Aug 04 '21 at 11:35

In a traditional supervised binary classification, one way is aggregate features. For example a feature is "number of packets from x ip address in the past n milliseconds/seconds/minutes/hours (or whatever your relevant time frames are)". "Number of ping/connect/? packets from x ip address in the past time frame(s)". Total size in bytes from ip address in the past time frame. Maybe change ip address to subnet. Or not even a subnet if there is a wider/bot attack.

Easy enough to do in training. During scoring, load the initial features into a cache then keep that cache updated to pull the features from.

I have built many models like this - offline feature generation that is updated every night/week/* that are loaded into a run-time cache and a few hot features that are constantly updated by the scoring process into the cache.

The subject matter experts need to work on which are the best aggregate features. Timeframe, relevant information for each timeframe, relevant scope of each feature.

Even with cache technology that is very fast, this needs to be highly tuned code that you are writing and a very fast model to keep up with packets in real-time. Figure out the time the model has to see a packet then act, then prototype to determine if the software and model approach is feasible.

Very interesting. I had avoided thinking along this route as it felt like I would be doing partial coding towards the behaviour of an IDS if that makes sense? Could I ask you another question please?
I guess you are loading the data into something like pandas, performing a calculation on a certain time period, adding the info into a new column and then marking that row with the result as being of the malicious class or not if it crosses a particular threshold? — Colin Crook, Aug 05 '21 at 15:28
This question sounds like you are looking how to label a packet as malicious. The label is the ground truth. First define what a malicious packet is, then label the data. There may be industry definitions or rules already, not my field. How you do that labeling is up to the business problem. In some areas, labeling is hard, manual, and time consuming. What I was describing was generating features for training a model that already has labeled data. — Craig, Aug 06 '21 at 13:03

How to train a classification algorithm with multiple samples that represent the class?

1 Answers1