Random Trees

This example illustrates how to use the 3rd ensemble method, random trees, to create a regression model.  We’ll again re-use the same partition of the Boston_Housing.xlsx dataset.

Input

Click Predict – Ensemble – Random Trees on the Data Science ribbon. 

Select MEDV as the Output variable and all remaining variables except CAT.MEDV, CHAS and Record ID as Selected Variables.  See the previous Boosting help topic for a screenshot of the Data tab. 

 Click Next to advance to the Random Trees- Data tab.

Recall that Random Trees only supports Decision Trees as a Weak Learner.  Click Decision Tree to change any options associated with this algorithm.  This example uses the default settings for the Decision Tree algorithm.  For more information on these options, see Regression Trees

Select Show Weak Learner Models to include this information in the output.  To see the output for Show Feature Importance see Regresstion Trees

Random Trees Regression dialog, Parameters tab

Click Next to advance to the Random Trees Scoring tab. 

Select all four options for Score Training/Validation data. 

When Detailed report is selected, Analytic Solver Data Science will create a detailed report of the Regression Trees output. 

When Summary report is selected, Analytic Solver Data Science will create a report summarizing the Regression Trees output.

When Lift Charts is selected, Analytic Solver Data Science will include Lift Chart and RROC Curve  plots in the output.

When Frequency Chart is selected, a frequency chart will be displayed when the RRandTrees_TrainingScore and RRandTrees_ValidationScore worksheets are selected.  This chart will display an interactive application similar to the Analyze Data feature, explained in detail in the Analyze Data chapter that appears earlier in this guide.  This chart will include frequency distributions of the actual and predicted responses individually, or side-by-side, depending on the user’s preference, as well as basic and advanced statistics for variables, percentiles, six sigma indices. 

Since we did not create a test partition, the options for Score test data are disabled.  See Data Science Partitioning for information on how to create a test partition. 

See Scoring New Data for more information on Score New Data in options. 

Random Trees Regression dialog, Scoring tab

Click Next to advance to the Simulation tab. 

Select Simulation Response Prediction to enable all options on the Simulation tab of the Regression Tree dialog. 

Simulation tab: All supervised algorithms include a new Simulation tab.  This tab uses the functionality from the Generate Data feature (described earlier in this guide) to generate synthetic data based on the training partition, and uses the fitted model to produce predictions for the synthetic data.  The resulting report, RRandTrees_Simulation, will contain the synthetic data, the predicted values and the Excel-calculated Expression column, if present.  In addition, frequency charts containing the Predicted, Training, and Expression (if present) sources or a combination of any pair may be viewed, if the charts are of the same type. 

Random Trees Regression dialog, Simulation tab

Evaluation:  Select Calculate Expression to amend an Expression column onto the frequency chart displayed on the RRandTrees_Simulation output tab.  Expression can be any valid Excel formula that references a variable and the response as [@COLUMN_NAME].  Click the Expression Hints button for more information on entering an expression.  Note that variable names are case sensitive.  See any of the prediction methods to see the Expression field in use. 

For more information on the remaining options shown on this dialog in the Distribution Fitting, Correlation Fitting and Sampling sections, see Generate Data.

Click Finish to run the Random Trees Ensemble Method on the example dataset.

Output

The output of the Ensemble Methods algorithm are inserted at the end of the workbook. 

RRandTrees_Output

This worksheet contains three sections:  the Output Navigator, Inputs and Boosting Method.

Output Navigator:  Double click RRandTrees_Output to view the Output Navigator, which is inserted at the top of each output worksheet.  Click any link in this table to navigate to various sections of the output. 

Output Navigator

Inputs:  Scroll down to the Inputs section to find all inputs entered or selected on all tabs of the Random Trees dialog.

RRandomTrees_Output, Inputs Report

Boosting Model:  Click the Boosting Model link on the Output Naviagator to view the Boosting model for each weak learner.  Recall that the default is "10" on the Parameters tab. 

RRandTrees_TrainingScore & RRandTrees_ValidationScore

Click the RRandTrees_TrainingScore and RRandTrees_ValidationScore tabs to view the newly added Output Variable frequency chart, the Training\Validation:  Prediction Summary and the Training\Validation:  Prediction Details report.  Note:  To view charts in the Cloud app, click the Charts icon on the  Ribbon, select a worksheet under Worksheet and a chart under Chart. 

Frequency Charts:  The output variable frequency chart opens automatically once the RRandTrees_TrainingScore or RRandTrees_ValidationScore worksheet is selected.  For more information on this dialog, see the Boosting example appearing in the previous help topics.   

RRandTrees_TrainingScore Frequency chart                                                                       RRandTrees_ValidationScore Frequency chart 

Training:  Prediction Summary:  Click the Training:  Prediction Summary link on the Output Navigator to open the Training Summary and the Validation: Prediction Summary link to open the Validation Summary This data table displays various statistics to measure the performance of the trained network:  Sum of Squared Error (SSE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), the Median Absolute Deviation (MAD) and the Coefficient of Determination (R2).

 RRandTrees_TrainingScore Frequency chart                                    RRandTrees_ValidationScore Frequency chart

Training:  Prediction Details:  Scroll down to view the Prediction Details data table.  This table displays the Actual versus Predicted values, along with the Residuals, for the training dataset. 

RRandTrees_TrainingScore Frequency chart                                         RRandTrees_ValidationScore Frequency chart

RRandTrees_TrainingLiftChart & RRandTrees_ValidationLiftChart

Click the RRandTrees_TrainLiftChart and RRandTrees_ValidLiftChart tabs to navigate to the Lift Charts and Regression RROC curves for both the training and validation datasets.  For more information on how to interpret these charts, see Regression Trees

Note:  To view these charts in the Cloud app, click the Charts icon on the  Ribbon, select RRandTrees_TrainingLiftChart or RRandTrees_ValidationLiftChart for Worksheet and Decile Chart, ROC Chart or Gain Chart for Chart.

RRandTrees_Simulation

As discussed above, Analytic Solver Data Science generates a new output worksheet, RRandTrees_Simulation, when Simulate Response Prediction is selected on the Simulation tab of the Randon Trees Regression dialog. 

This report contains the synthetic data, the predicted values for the training data (using the fitted model) and the Excel – calculated Expression column, if populated in the dialog.  Users can switch between the Predicted, Training, and Expression sources or a combination of two, as long as they are of the same type. 

Synthetic Data

The data contained in the Synthetic Data report is syntethic data, generated using the Generate Data feature described in the chapter with the same name, that appears earlier in this guide.   

The chart that is displayed once this tab is selected, contains frequency information pertaining to the output variable in the training data, the synthetic data and the expression, if it exists.  (Recall that no expression was entered in this example.)

Frequency Chart for Prediction (Simulation) data

Click Prediction (Simulation) to add the training data to the chart.

Click Prediction(Simulation) and Prediction (Training) to change the Data view.

Data Dialog

In the chart below, the dark blue bars display the frequencies for the synthetic data and the light blue bars display the frequencies for the predicted values in the Training partition. 

Prediction (Simulation) and Prediction (Training) Frequency chart for MEDV variable

The Relative Bin Differences curve charts the absolute differences between the data in each bin.  Click the down arrow next to Statistics to view the Bin Details pane to display the calculations.

Click the down arrow next to Frequency to change the chart view to Relative Frequency or to change the look by clicking Chart Options.  Statistics on the right of the chart dialog are discussed earlier in this section.  For more information on the generated synthetic data, see Generate Data.  

See Scoring New Data for information on the Stored Model Sheet, RRandTrees_Stored.