Categories

Versions

You are viewing the RapidMiner Studio documentation for version 8.2 - Check here for latest version

Explain Predictions (Model Simulator)

Synopsis

This operator identifies the attributes that play the largest role when making a prediction.

Description

Given a model and an input, you can generate a prediction, but which of the attributes plays the largest role in forming that prediction? This operator takes a model and an ExampleSet as input, and generates a table highlighting the attributes that most strongly support (green) or contradict (red) each prediction. Alternatively, the table can be displayed with two extra columns (support and predict) containing numeric details.

For each Example in an ExampleSet, this operator generates a neigboring set of data points, and uses correlation to identify the local attribute weights in that neighborhood. Although the relationship between attributes and predictions may be highly non-linear globally, the local linear relationship is more than powerful enough to explain the predictions.

This operator works with all data types and data sizes. It supports both classification and regression problems. The only model type which is not recommended is k-Nearest Neighbors, since this model typically suffers from long runtimes.

Input

  • model (Model)

    This input port expects a model.

  • training data (Data Table)

    This input port expects an ExampleSet identical to the one that trained the model.

  • test data (Data Table)

    This input port expects an ExampleSet with test data.

Output

  • visualization output

    This output port displays the test data with predictions and color highlighting of attributes: green when the value of the attribute supports the prediction, and red when the value of the attribute contradicts the prediction.

  • example set output (Data Table)

    This output port displays the test data with predictions and two extra columns: one that details the attributes that support the prediction and one that details the attributes that contradict the prediction.

  • importances output (Data Table)

    This output port displays the test data in a long table format including the importance of all attributes for each row. This can be useful if the data should be visualized later on.

Parameters

  • maximal explaining attributes The maximal number of attributes used to support the predictions, also the maximal number of attributes used for contradicting it. The whole point about explanations is that they allow you to focus on the factors that matter in each particular case. We recommend a value of 3 to achieve this but you can increase this number if you feel that you need more factors to explain the predictions to you. Please note that you might end up with less factors if only less attribute values than the maximal number support or contradict a prediction in this case. Range: integer
  • local sample size The number of locally generated samples around each prediction data point to identify the attributes with the biggest impact on this decision. You might want to increase this number for high-dimensional data sets in case the quality of predictions become worse. Please note that the runtime of this algorithm slows down with higher numbers. In general, a value of around 500 delivers high-quality explanations in a reasonable amount of time. Range: integer

Tutorial Processes

Explaining Predictions for Titanic

This process trains a Naive Bayes model on the Titanic data. It then uses the Explain Predictions operator to create the predictions and all local explanations for the second data set.

You can see the two results. First the data with additional columns for the predictions, the confidences, and the new explanations. The other result directly visualizes the explanations with colors. Green means a value which strongly supports the prediction. Red means that this value contradicts the prediction. Have a look at the 3rd row for example. The model predicts "Yes" for survival despite the fact that the gender is male. In general, most men died during the accident though so the model made this prediction based on the other values. In this case, this would be the age of 71, the amount of money paid, and the fact that this person traveled without parents or children.