Usage example of NoiseCut
Here, we present a usage example of NoiseCut within the context of a binary classification task. To illustrate this, we employ a synthetic dataset that has been generated following the guidelines outlined in the Generation_of_synthetic_data.ipynb notebook.
[1]:
import pandas as pd
from noisecut.model.noisecut_coder import Metric
from noisecut.model.noisecut_model import NoiseCut
from noisecut.tree_structured.data_manipulator import DataManipulator
1. Set training and test sets
Assign X as the features and Y as the labels.
[2]:
input_file = "../data/7D_synthetic_data_manual"
data = pd.read_csv(
input_file,
delimiter=" ",
header=None,
skiprows=1,
engine="python",
)
X = data.iloc[:, :-1]
Y = data.iloc[:, -1]
To randomly sample the training and test sets, you can use the build-in function of the DataManipulator class. If you also work with a synthetic dataset (like this example), you can also add noise to the labeling of the data by using get_noisy_data function of the DataManipulator class.
[3]:
Training_set_size = 50 # The percentage of training set
Noise_intencity = (
5 # The labels' percentage should be toggled from 0 to 1, or vice versa.
)
manipulator = DataManipulator()
x_noisy, y_noisy = manipulator.get_noisy_data(
X,
Y,
percentage_noise=Noise_intencity,
)
x_train, y_train, x_test, y_test = manipulator.split_data(
x_noisy,
y_noisy,
percentage_training_data=Training_set_size,
)
2. Fitting the model
To fit the training set into the hybrid model, you should use NoiseCut class. To instantiate an object of this class, you have to provide an array n_input_each_box as an input which is an indicator of the tree-structure of the hybrid model. First element of the n_input_each_box represents number of input features to the first-layer black boxes, which is 3 in the example of the synthetic data generated in the Generation_of_synthetic_data.ipynb notebook; second element
represents number of input features to the second first-layer black boxes, which is 2 and it continues in this manner.
To fit the training set into the hybrid model, utilize the NoiseCut class. To instantiate an object of this class, you’ll need to provide an input array called n_input_each_box. This array serves as an indicator for the tree-structure of the hybrid model. The initial element of n_input_each_box corresponds to the number of input features for the first black box in the first layer of the network, which is 3 in the example of the synthetic data generated in the
Generation_of_synthetic_data.ipynb notebook; Subsequently, the second element signifies the number of input features for the second first-layer black box, which in this case is 2. This pattern continues for the successive elements.
Then, the model can be simply fitted by using fit function of the NoiseCut class.
[4]:
mdl = NoiseCut(n_input_each_box=[3, 2, 2])
mdl.fit(x_train, y_train)
3. Evaluation
The evaluation of the NoiseCut algorithm’s performance can be conducted by utilizing the test set. This test set can be provided as input to the predict function within the NoiseCut class.
To assess the model’s performance, you can utilize the built-in function of the Metric class called set_confusion_matrix. This function enables you to establish the confusion matrix, thereby facilitating the computation of accuracy, recall, precision, and F1 score for the predicted output derived from the test dataset.
[5]:
y_predicted = mdl.predict(x_test)
accuracy, recall, precision, F1 = Metric.set_confusion_matrix(
y_test, y_predicted
)
print(
"accuracy = {a:3.3f}, recall = {r:3.3f}, precision = {p:3.3f}, "
"F1 = {f:3.3f}".format(a=accuracy, r=recall, p=precision, f=F1)
)
accuracy = 0.812, recall = 0.829, precision = 0.829, F1 = 0.829
4. Predictions as probability
The outcomes of the hybrid model can be obtained by calculating the probability of the label being 1 for any binary input fed into the model. This can be accomplished using the predict_probability_of_being_1 function within the NoiseCut class. You can insert a single binary input or even more than one as an array of shape (n_sample, n_festures). If you insert more than one binary input, you receive an array of shape (n_samples,) of the probabilities in one-to-one mapping of the
binary input.
[6]:
y_pred_proba = mdl.predict_probability_of_being_1([0, 0, 0, 0, 0, 0, 0])
print(f"Prediction probability for a binary input: {y_pred_proba}")
y_pred_proba = mdl.predict_probability_of_being_1(
[[0, 0, 0, 0, 0, 0, 0], [1, 0, 1, 0, 1, 0, 1]]
)
print(f"Prediction probability for two binary inputs: {y_pred_proba}")
Prediction probability for a binary input: 1.0
Prediction probability for two binary inputs: [1. 1.]
The predict_probability_of_being_1 function can be applied to the complete test set in order to obtain the predicted probabilities. With these probabilities at hand, it becomes possible to calculate the area under the ROC curve.
[7]:
from sklearn import metrics # noqa: E402
y_pred_proba = mdl.predict_probability_of_being_1(x_test)
fpr, tpr, thresholds = metrics.roc_curve(y_test.astype(int), y_pred_proba)
print("AUC-ROC=", metrics.auc(fpr, tpr))
AUC-ROC= 0.9192118226600985
5. Retrieved functions of the black boxes
After fitting model, the predicted binary function of first-layer black boxes can be taken by calling get_binary_function_of_box of the NoiseCut class. You have to give the ID of first-layer black box as an input which is a number in range [0, n_box-1]. Moreover, the predicted binary function of second-layer black box can be taken by calling get_binary_function_black_box of the NoiseCut class. It does not need any input as there is only one second-layer black box.
[8]:
func_0 = mdl.get_binary_function_of_box(0)
func_1 = mdl.get_binary_function_of_box(1)
func_2 = mdl.get_binary_function_of_box(2)
func_bb = mdl.get_binary_function_black_box()
func_0
[8]:
array([ True, True, False, False, False, True, False, True])