Synthetic data generation

We generated synthetic data sets to benchmark the binary classification performance of NoiseCut against other machine learning classifiers. Synthetic data sets were created such that the structure of the information flow from binary-represented input data \(\mathbf{x} \in \{0,1\}^n\) to binary outputs or labels \(y \in \{0,1\}\) conforms to a tree-structured network, as illustrated the figure below:

b78886a89144474a9cbc97dffd2a330a

Figure 1: A schematic representation of the information flow from binary represented input data to binary labels. This procedure has been used to generate the synthetic data.

Figure 1 illustrates an example of the labeling procedure in the synthetic datasets. We assumed a tree-structured network \(\mathcal{F}: \{0,1\}^7 \longmapsto \{0,1\}\) mapping binary variables \(\mathbf{x}\) to binary labels \(y\): :nbsphinx-math:`begin{align*}

y = mathcal{F}(X) ;;,;; mathbf{x} in {0,1}^7 ;;,;; y in {0,1}.

end{align*}`

In the network of Figure 1, there are three first-layer boxes \(\mathrm{F_1}: \{0,1\}^3 \longmapsto \{0,1\}\), \(\mathrm{F_2}: \{0,1\}^2 \longmapsto \{0,1\}\), and $:nbsphinx-math:mathrm{F_3}: {0,1}^2 :nbsphinx-math:`longmapsto `{0,1} $ that separately perform computations on subsets of input features. Here are the I/O functions of the first-layer boxes in Figure 1:

\[\begin{split}\begin{aligned} F_1: \begin{pmatrix} 0 & 0 & 0 \\ 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 1 \\ \end{pmatrix} &\longmapsto \begin{pmatrix} 0 \\ 0 \\ 1 \\ 1 \\ 1 \\ 0 \\ 1 \\ 0 \\ \end{pmatrix} , & F_2: \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ \end{pmatrix} &\longmapsto \begin{pmatrix} 1 \\ 0 \\ 1 \\ 1 \\ \end{pmatrix} , & F_3: \begin{pmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ \end{pmatrix} &\longmapsto \begin{pmatrix} 1 \\ 1 \\ 0 \\ 0 \\ \end{pmatrix}. \end{aligned}\end{split}\]

For instance, when we enter \(\mathbf{x}^\prime = [0, 1, 0, 0, 1, 1, 0]\) to the network, the three first-layer boxes return \([1, 1, 1]\), which is then forwarded to the output box \(\mathrm{F_O}: \{0,1\}^3 \longmapsto \{0,1\}\) with the following I/O function: [

] Finally, the output box returns the generated label, here \(y^\prime=0\), for the entered input \(\mathbf{x}^\prime\) to the network.

Generating tree-structured data through randomly defined functions

One can generate tree-structured synthetic data featuring an arbitrary number of first-layer boxes and an output-box by using NoiseCut. The functionality of each black box can be assigned randomly or manually determined.

For the generation of a tree-structured synthetic dataset featuring interior black boxes with randomly allocated functions, one can seamlessly employ the SampleGenerator class.

To instantiate an object of this class, you need to input an array which indicates the number of input features to each first-layer black box. The first element of the array represents the number of input features to the first black box, the second element represents the number of input features to the second black box, and the rest follows the same. The length of the array is also an indicator of the number of first-layer black boxes, which is 3 in the below example. If you set allowance_rand=True, all the functions are set randomly when the object is instantiated.

[1]:
from noisecut.tree_structured.sample_generator import SampleGenerator

gen_dataset = SampleGenerator([3, 2, 2], allowance_rand=True)

To construct the dataset for the randomly generated model, simply invoke the get_complete_data_set function found within the SampleGenerator class.

[2]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()

If you also call the get_complete_data_set function with an input, as a path to store the result, a file with the input name will be created in the path provided.

[3]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="../data/7D_synthetic_data_random"
)
print("Generated binary labels:", "\n", y_gen_dataset.astype(int))
Generated binary labels:
 [0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 1
 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

The randomly set binary function of first-layer black boxes can be taken by calling get_binary_function_of_box function of the SampleGenerator class. You have to give the ID of first-layer black box as an input which is a number in range [0, n_box-1]. Moreover, the randomly set binary function of the output-box can be taken by calling get_binary_function_black_box of the SampleGenerator class. It does not need any input as there is only one output-box in the nework.

[4]:
func_0 = gen_dataset.get_binary_function_of_box(0)
func_1 = gen_dataset.get_binary_function_of_box(1)
func_2 = gen_dataset.get_binary_function_of_box(2)
func_bb = gen_dataset.get_binary_function_black_box()
print("The function of the output-box:", "\n", func_bb)
The function of the output-box:
 [False  True False False False  True False  True]

You can also obtain the functions of all the first-layer black boxes, along with the function of the output box, simultaneously, by invoking gen_dataset.print_binary_function_model().

[5]:
gen_dataset.print_binary_function_model()
Function Box1
([feature_1, feature_2, feature_3]: Binary Output) ->
([0 0 0]: 0), ([1 0 0]: 1), ([0 1 0]: 0), ([1 1 0]: 1), ([0 0 1]: 0), ([1 0 1]: 1), ([0 1 1]: 1), ([1 1 1]: 0)
Function Box2
([feature_4, feature_5]: Binary Output) ->
([0 0]: 1), ([1 0]: 0), ([0 1]: 1), ([1 1]: 1)
Function Box3
([feature_6, feature_7]: Binary Output) ->
([0 0]: 0), ([1 0]: 1), ([0 1]: 0), ([1 1]: 0)
Function Black Box
([Output_box_1, Output_box_2, Output_box_3]: Binary Output) ->
([0 0 0]: 0), ([1 0 0]: 1), ([0 1 0]: 0), ([1 1 0]: 0), ([0 0 1]: 0), ([1 0 1]: 1), ([0 1 1]: 0), ([1 1 1]: 1)

Generating tree-structured data by setting functions manually

In the same manner as random generating tree-structured data through randomly defined functions, after importing the SampleGenerator class with allowance_rand=False, you need to instantiate an object of the class.

[6]:
from noisecut.tree_structured.sample_generator import (  # noqa: E402
    SampleGenerator,
)

gen_dataset = SampleGenerator([3, 2, 2], allowance_rand=False)

To set the functions manually, you can use the set_binary_function_of_box function of the SampleGenerator class. Input variables of the function are ID of the associated first-layer black box and the desired binary function of the box. In the example below, we generated the binary functions depicted in Figure 1.

[7]:
gen_dataset.set_binary_function_of_box(0, [0, 0, 1, 1, 1, 0, 1, 0])
gen_dataset.set_binary_function_of_box(1, [1, 0, 1, 1])
gen_dataset.set_binary_function_of_box(2, [1, 1, 0, 0])
gen_dataset.set_binary_function_black_box([0, 0, 1, 1, 1, 0, 1, 0])

After determining all functions of the black boxes, you can check whether your generated dataset doesn’t provide an in vain black box in the network by calling has_synthetic_example_functionality function of the SampleGenerator class. If the function returns Flase, you might need to change the determined functions of the black boxes and check it again. This test will enable you to create a non-reducible tree-structured dataset by incorporating productive black boxes within the network.

[8]:
gen_dataset.has_synthetic_example_functionality()
[8]:
True

You can also get and store the compelete dataset in the same manner as it has been explained in the previous part.

[9]:
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set()
x_gen_dataset, y_gen_dataset = gen_dataset.get_complete_data_set(
    file_name="../data/7D_synthetic_data_manual"
)