5

When i look at the available datasets in https://www.openml.org i often see a BNG dataset with no further information about it.

Can someone explane what BNG means in this context?

I am especially interested in this dataset: https://www.openml.org/d/1389

Has anyone more information about where this data set comes from?

y4nnick
  • 53
  • 3

1 Answers1

6

The Bayesian Network Generated (BNG) datasets are a set of artificially generated datasets openly available on OpenML. These datasets were generated to fill the need for a large heterogeneous set of large datasets. This paper describes the BNG generator best: Algorithm Selection on Data Streams.

Small quote from the paper about the BNG data generator:

The generator takes a dataset as input, and outputs a data stream containing a similar concept, with a predefined number of instances. The input dataset is preprocessed with the following operations: all missing values are first replaced by the majority value of that attribute, and numeric attributes are discretized using Weka’s binning algorithm.

A personal note: For general Machine Learning studies, I would refrain from using BNG (or any other kind of artificially generated) datasets, as the concept is generally simpler than the original dataset. Instead, it is recommendable to use a per-defined benchmark suite, such as the OpenML-100.

  • Thanks for your answer! Would be very useful if this information would be at the openml.org page, because thats the place where many BNG datasets are discovered. – y4nnick Jan 24 '18 at 00:14