# Black Tree AutoML

Help

Below is a detailed summary of each algorithm provided in Black Tree Massive, which includes all of the algorithms in Black Tree Pro. This will allow you to analyze output data and produce charts and other output similar to what is automatically generated by Black Tree.

**Pro Algorithms**

*Nearest Neighbor*

Given an input vector x (i.e., one row of a dataset of Euclidean vectors), the Nearest Neighbor algorithm returns the vector y, for which ||x – y|| is minimum over the dataset. In plain terms, given an input vector x, the vector y is closest to x (i.e., the “nearest neighbor” of x). The predicted classifier of x is the classifier of y. The Nearest Neighbor algorithm itself can under reasonable circumstances produce perfect accuracy. See Lemma 1.1 of Analyzing Dataset Consistency [1].

Output Files

Nearest Neighbor Vector: Entry i is the the row number of the nearest neighbor of row i.

Error Vector: Entry i is 1 if the predicted class for row i is incorrect, and zero otherwise.

Cluster Rows Matrix: Entry (i,j) is 1 if row j is the nearest neighbor of row i, and zero otherwise.

Predicted Class Vector: Entry i is the predicted classifier of row i.

Figure 1: The vertical axis is the value in each dimension (i.e., the columns in the dataset). The horizontal axis is the column number. The values displayed are (i) the minimum value in each dimension, (ii) the maximum value in each dimension, (iii) a given input vector, and (iv) its nearest neighbor. The purpose is to create a visual intuition for the distances between vectors in the dataset.

*Delta Classification Generally*

The dataset is clustered using spherical clusters of radius delta. For unsupervised algorithms (Delta Classification, Delta Modal Classification), the value of delta is calculated by finding the maximum of the derivative of cluster sizes as a function of delta. For the supervised algorithm (Sup. Delta Classification), delta is the largest radius that does not produce clustering errors. Prediction is in all cases accomplished using the Nearest Neighbor algorithm, and if the Nearest Neighbor of a given testing row is beyond delta (i.e., the norm of the difference between the two vectors exceeds delta), then that prediction is “rejected” as a likely error. Delta Classification can under reasonable circumstances produce perfect accuracy. See Lemma 2.3 of [1].

*Delta Classification*

This algorithm is completely unsupervised, and does not require training classifiers. When run on its own, it can discover the “natural” clusters within a dataset, that are at times more precise than human classifier labels (e.g., people can draw the same digits differently, and this algorithm can at times distinguish cases like these on its own, without knowing classifiers). All Delta Classification algorithms use spherical clustering, and in this case, the predicted classifier is the class of the origin of the spherical cluster to which a given input vector is assigned. An input vector is assigned to a cluster using the Nearest Neighbor algorithm (i.e., row i is assigned to the cluster for row j, if row j is the nearest neighbor of row i).

*Delta Modal Classification*

This algorithm constructs clusters using the completely unsupervised Delta Classification algorithm. However, the predicted classifier is the modal class (i.e., most frequent class) of the cluster to which a given input vector is assigned. As a consequence, this algorithm requires training classifiers. Moreover, this algorithm is therefore beyond Lemma 2.3 from [1], and instead assumes that the most frequent class within a region of Euclidean space is the correct classifier.

*Supervised Delta Classification*

This algorithm constructs clusters using the Supervised Delta Classification algorithm, which extends the value of delta until it encounters its first error, thereby creating homogeneous clusters (i.e., all elements of a cluster have the same classifier). The prediction step relies upon the Nearest Neighbor algorithm, and is in that regard identical to the unsupervised classification algorithm. This algorithm is within Lemma 2.3, and will therefore under qualifying circumstances produce perfect accuracy. For Black Tree Massive, the Supervised Delta Classification algorithm should make better use of memory and processing on consumer devices, since it does not store the clusters produced during the training step. However, this requires the user to retrieve clusters, which can be accomplished using the Delta Vector output file, which stores the radius for each training row (i.e., entry i is the radius for row i). Use the following function:

[cluster_vector diff] = find_delta_cluster_BlackTree(input_vector, dataset, delta, N),

where input_vector is a row of the dataset, dataset is the full dataset, delta is the corresponding value in the Delta Vector, and N is the number of columns (excluding the classifier column). The cluster_vector is a binary vector where entry i is 1 if row i is included in the cluster of the input vector. Entry i of the second output, diff, is the Euclidean distance between row i and input_vector.

*Image Classification*

A sample image is selected from the dataset and sized for compression, into a super-pixel image. All other images are compressed using the same sizing. The super-pixel images are then stored in a matrix, and treated as a dataset. Supervised Delta Classification is then applied to the resultant dataset. Note that for Black Tree Massive, the algorithm applied is the same Supervised Delta Algorithm applied in Pro, and so Black Tree Massive does not include superior Image Classification.

Output Files

Nearest Neighbor Vector: Unchanged.

Error Vector: Unchanged.

Cluster Rows Matrix: Entry (i,j) is 1 if row j is contained in the cluster for row i, and otherwise 0.

Predicted Class Vector: Unchanged.

Rejected Rows Vector: Entry i is 1 if the nearest neighbor of input vector i is beyond delta, and zero otherwise.

Accepted Rows Vector: It is the logical compliment of the Rejected Rows Vector.

Figure 1: The vertical axis is the value in each dimension (i.e., the columns in the dataset). The horizontal axis is the column number. The values displayed are (i) the minimum value in each dimension, (ii) the maximum value in each dimension, and (iii) the largest cluster in the dataset. The purpose is to create a visual intuition for the distances between vectors in the dataset.

Figure 2 (Image Classification): A sample image from the dataset.

Figure 2 (Delta Classification and Delta Modal Classification): The horizontal axis is accuracy, the vertical axis is the number of clusters with at least a given accuracy, but less than the next accuracy range. The purpose is to show the accuracy distribution of the clusters, which is often indicative of prediction accuracy. The class of the cluster is assumed to be the class of the origin, and each row in the cluster with a classifier that is not equal to the class of the origin constitutes an error.

Figure 3 (Image Classification): The sample image compressed into a super-pixel image.

**Massive Algorithms**

The output of the Massive Algorithms is deliberately minimal, under the assumption that the datasets are unusually large. These algorithms rely upon a theorem connecting the Nearest Neighbor algorithm and sorting. See Theorem 2.1 of Sorting, Information, and Recursion [2]. The confidence metrics rely upon assumptions set out in Information, Knowledge, and Uncertainty [3].

*Size-Based Classification*

The dataset is first sorted, and clusters are generated by iterating in both directions in the sorted list from a given testing row. The cluster size is determined by the number of consecutive consistent classifiers (i.e., it terminates at first error / change in classifier). The predicted class is simply the class of the cluster (i.e., all clusters have a single consistent classifier). If the cluster for a given testing row is empty, then the prediction is rejected. Confidence then further filters predictions, with confidence given by the size of the cluster for a given testing row. See [3] generally. All testing rows below a given confidence are rejected, producing a curve that shows accuracy as a function of confidence. This process is applied a number of times equal to the Number of Iterations, to random Training / Testing subsets, with a fixed Training Percentage of 85%. All reported training statistics reflect cumulative data over all iterations.

*Entropy-Based Classification*

The dataset is first sorted, and clusters are generated by iterating in both directions in the sorted list from a given testing row. The cluster size is determined by a fixed radius, terminating once that radius is reached. That radius is a function of the standard deviation of the columns of the dataset. The predicted class is the modal class of the cluster for a given testing row. If the cluster for a given testing row is empty, then the prediction is rejected. Confidence then further filters predictions, with confidence given by a function of the entropy of the classifier distribution of the cluster for a given testing row. See [3] generally, in particular Section 3. All testing rows below a given confidence are rejected, producing a curve that shows accuracy as a function of confidence. This process is applied a number of times equal to the Number of Iterations, to random Training / Testing subsets, with a fixed Training Percentage of 85%. All reported training statistics reflect cumulative data over all iterations.

Output Files

Predicted [Cluster Size / Confidence] Vector: Entry i is the confidence metric of row i, for the prediction step.

Predicted Class Vector: Unchanged.

Actual Class Matrix: Entry (i,j) is the actual classifier for row i, during training iteration j.

Predicted Class Matrix: Entry (i,j) is the predicted classifier for row i, during training iteration j.

Accuracy Vector: This vector contains the accuracy data for Figures 1 and 2.

Confidence Matrix: Entry (i,j) is the confidence metric of row i, during training iteration j.

Testing Rows Matrix: Entry (i,j) is the actual row number (i.e., in the full dataset) for random testing row i, during training iteration j.

Figure 1: The vertical axis is accuracy, and the horizontal axis is the percentage of predictions rejected. As a result, this shows accuracy as a function of rejected predictions, as generated during the training step. As the confidence threshold is increased, more predictions become rejected (i.e., they fail to meet the minimum confidence threshold), which typically increases accuracy. During the prediction step, a confidence metric is calculated for every classification prediction. Though of course not guaranteed, the purpose of the training step is to produce implied accuracies at each minimum confidence threshold, thereby producing an implied accuracy for predictions.

Figure 2: The vertical axis is accuracy, and the horizontal axis is the minimum confidence threshold. As a result, this shows accuracy as a function minimum confidence.

Figure 3: The vertical axis is the value in each dimension (i.e., the columns in the dataset). The horizontal axis is the column number. The values displayed are (i) the minimum value in each dimension, (ii) the maximum value in each dimension, and (iii) the largest cluster in the dataset. The purpose is to create a visual intuition for the distances between vectors in the dataset.

Figure 4: The horizontal axis is accuracy, the vertical axis is the number of iterations with at least a given accuracy, but less than the next accuracy range. This figure is limited to the 90% confidence interval, and the maximum confidence interval, for Size-Based and Entropy-Based Classification, respectively. These are the levels that are typically most useful.