Using Classification Tree

The following options appear on the Classification Tree dialogs.

Classification Tree dialog, Data tab

Variables In Input Data

The variables included in the data set appear here.

Selected Variables

Variables selected to be included in the output appear here.

Output Variable

The dependent variable or the variable to be classified appears here.

# Classes

Displays the number of classes in the Output Variable.

Specify "Success" class (for Lift Chart)

This option is selected by default. Click the drop-down arrow to select the value to specify a success. This option is enabled when the number of classes for the Output Variable is equal to 2.

Specify initial cutoff probability for success

Enter a value between 0 and 1 here to denote the cutoff probability for success. If the calculated probability for success for an observation is greater than or equal to this value, a success (or a 1) is predicted for that observation. If the calculated probability for success for an observation is less than this value, a non-success (or a 0) is predicted for that observation. The default value is 0.5. This option is enabled when the number of classes for the output variable is equal to 2.

Classification Tree dialog, Parameters tab

Partitioning Options

Analytic Solver Data Science includes the ability to partition a dataset from within a classification or prediction method by clicking Partition Data on the Parameters dialog. Analytic Solver Data Science will partition your dataset (according to the partition options you set) immediately before running the classification method. If partitioning has already occurred on the dataset, this option will be disabled. For more information on partitioning, please see the Data Science Partitioning chapter.

Rescale Data

Use Rescaling to normalize one or more features in your data during the data preprocessing stage. Analytic Solver Data Science provides the following methods for feature scaling: Standardization, Normalization, Adjusted Normalization and Unit Norm. For more information on this new feature, see the Rescale Continuous Data section within the Transform Continuous Data chapter that occurs earlier in this guide.

Analytic Solver Data Mining: Notes on Rescaling and Simulation functionality

Tree Growth

In the Tree Growth section, select Levels, Nodes, Splits, and Records in Terminal Nodes. Values entered for these options limit tree growth, i.e. if 10 is entered for Levels, the tree will be limited to 10 levels.

Prior Probability

Three options appear in the Prior Probability Dialog: Empirical, Uniform and Manual.

If the first option is selected, Empirical, Analytic Solver Data Science will assume that the probability of encountering a particular class in the dataset is the same as the frequency with which it occurs in the training data.
If the second option is selected, Uniform, Analytic Solver Data Science will assume that all classes occur with equal probability.
Select the third option, Manual, to manually enter the desired class and probability value.

Prune (Using Validation Set)

If a validation partition exists, this option is enabled. When this option is selected, Analytic Solver Data Science will prune the tree using the validation set. Pruning the tree using the validation set reduces the error from over-fitting the tree to the training data.

Show Feature Importance

Select Feature Importance to include the Features Importance table in the output. This table displays the variables that are included in the model along with their Importance value.

Maximum Number of Levels

This option specifies the maximum number of levels in the tree to be displayed in the output.
Note: If a tree is limited to X levels in the output (intentionally or due to Analytic Solver Basic's limit of 7 levels), Analytic Solver will draw the first X levels of the diagram.

Trees to Display

Select Trees to Display to select the types of trees to display: Fully Grown, Best Pruned, Minimum Error or User Specified.
• Select Fully Grown to “grow” a complete tree using the training data.
• Select Best Pruned to create a tree with the fewest number of nodes, subject to the constraint that the error be kept below a specified level (minimum error rate plus the standard error of that error rate).
• Select Minimum error to produce a tree that yields the minimum classification error rate when tested on the validation data.
• To create a tree with a specified number of decision nodes select User Specified and enter the desired number of nodes.

Classification Tree dialog, Scoring tab

The following options appear on the Classification Tree - Step 3 of 3 dialog.

Score Training Data

Select these options to show an assessment of the performance of the Classification Tree algorithm in classifying the training data. The report is displayed according to your specifications - Detailed, Summary, and Lift charts. Lift charts are only available when the Output Variable contains 2 categories.

New in V2023: When Frequency Chart is selected, a frequency chart will be displayed when the CT_TrainingScore worksheet is selected. This chart will display an interactive application similar to the Analyze Data feature, explained in detail in the Analyze Data chapter that appears earlier in this guide. This chart will include frequency distributions of the actual and predicted responses individually, or side-by-side, depending on the user’s preference, as well as basic and advanced statistics for variables, percentiles, six sigma indices.

Score Validation Data

These options are enabled when a validation data set is present. Select these options to show an assessment of the performance of the Classification Tree algorithm in classifying the validation data. The report is displayed according to your specifications - Detailed, Summary, and Lift charts. Lift charts are only available when the Output Variable contains 2 categories. When Frequency Chart is selected, a frequency chart (described above) will be displayed when the CT_ValidationScore worksheet is selected.

Score Test Data

These options are enabled when a test set is present. Select these options to show an assessment of the performance of the Classification Tree algorithm in classifying the test data. The report is displayed according to your specifications - Detailed, Summary, and Lift charts. Lift charts are only available when the Output Variable contains 2 categories. When Frequency Chart is selected, a frequency chart (described above) will be displayed when the CT_TestScore worksheet is selected.

Score New Data

See the Scoring chapter within the Analytic Solver Data Science User Guide for more information on the options located in the Score Test Data and Score New Data groups.

Classification Tree dialog, Simulation tab

All supervised algorithms in V2023 include a new Simulation tab. This tab uses the functionality from the Generate Data feature (described earlier in this guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data. The resulting report, CT_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present. In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type.

Evaluation: Select Calculate Expression to amend an Expression column onto the frequency chart displayed on the CT_Simulation output tab. Expression can be any valid Excel formula that references a variable and the response as [@COLUMN_NAME]. Click the Expression Hints button for more information on entering an expression.