The Bayesian Network Generated (BNG) datasets are a set of artificially generated datasets openly available on OpenML. These datasets were generated to fill the need for a large heterogeneous set of large datasets. This paper describes the BNG generator best:
Algorithm Selection on Data Streams.
Small quote from the paper about the BNG data generator:
The generator takes a dataset as input, and outputs a data stream containing
a similar concept, with a predefined number of instances. The input dataset is
preprocessed with the following operations: all missing values are first replaced by the majority value of that attribute, and numeric attributes are discretized using Weka’s binning algorithm.
A personal note: For general Machine Learning studies, I would refrain from using BNG (or any other kind of artificially generated) datasets, as the concept is generally simpler than the original dataset. Instead, it is recommendable to use a per-defined benchmark suite, such as the OpenML-100.