2

I am seeking to find a dataset with log files that have labeled cybersecurity issues. As I am trying to build a cybersecurity log analysis model there is no preference on the type of the log, but there is a preference on existence of known cybersecurity issues in the data.

Currently all I was able to find log datasets(HDFS, BGL) that had anomalies which were not cybersecurity issues but rather execution flow errors. Also I have found numerous amounts of network data such as in https://vizsec.org/data/, but they contain network traffic instead of logs. Also, I have found log datasets that actually had cybersecurity issues but the quantity of them were too little to train a model on.

It would also be helpful to know, how is it possible to generate such a dataset in large quantities.

jsbc
  • 21
  • 2

2 Answers2

0

In reference with your little found data either augment it or apply cross validation on top of it.

else Look for your expected data in https://datasetsearch.research.google.com/

Durga K
  • 31
  • 2
0

See if this can help - Publicly Available Datasets

Also you can use SMOTE technique if you have insufficient data.

Madhur Yadav
  • 148
  • 14
  • Thank you for the answer. The problem I see with synthetic generation techniques in this case is that the log data is not robust to noise and a change of even a single character or a little change of the order of logs could potentially be a security issue. – jsbc Sep 15 '20 at 20:53