Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.0 - Check here for latest version

Model Simulator (Model Simulator)

Synopsis

This Operator provides an easy, real-time method to change the inputs to a model and view the output. It shows predictions, confidences, and explanations for those inputs.

Description

The outputs are designed to achieve three goals: First, users will get a better understanding of how the model comes to its conclusions, even for black box models like deep learning neural networks. Second, users can simulate cases where they know the outcome, and check if the model behaves as expected. Third, users can use the built-in optimization method to find optimal input settings in order to achieve a desired outcome. The latter turns predictive models into prescriptive models.

The result is displayed in two panels. In the left panel, users can change the input settings for all attributes, while in the right panel, the outputs are calculated and displayed in real time. Each input attribute (independent variable) of the model is displayed in a row, together with a user interface element corresponding to the value type of the attribute. At the end of each row is a little information symbol; when hovered, it displays additional information about the attribute, including statistics and the distribution of values. The length of the gray bars below each attribute name depicts the global importance of this attribute for the model (in contrast to the local importance for each specific prediction, which will be discussed below), based on its correlation with the predictions.

Users can select categorical values from a drop-down element, turn binary values on or off, and move numerical sliders to arbitrary values within the range defined by the minimum and maximum. Please note that attributes with value type date are not supported.

The "Optimize" button at the bottom of the input panel spawns a dialog enabling the user to determine the optimal input values needed to obtain a desired output. Also constrained optimizations are supported. When the optimization is completed, the optimal input values are displayed in the input panel.

All the outputs can be found on the right side and are calculated in real-time. There are five different parts which slightly differ depending on if you have a classification or a regression problem.

  • Most Likely / Prediction (top left): You can easily see what the current prediction would be. It shows the most likely class in case of classification and the predicted number in case of regression tasks. A bar chart showing the confidences for other likely classes is also shown in case of classification.
  • Confidence Distribution / Distributions of Prediction (top right): In case of classification, you will see the distribution of all confidence values for this class on a test data set if it was provided. The current confidence is highlighted. In case of regression, you will see how the current prediction relates to the distribution of predictions on a test set. Again, the distribution is only shown if a test set was provided.
  • Important Factors (bottom left): You can see how much the most important attributes contribute to the current prediction. An attribute value can either support a prediction (green bar) or contradict it (red bar). In contrast with the global importance of an attribute described previously (the gray bar in the input panel), the local importance of an attribute is based on its correlation with the predictions in the neighborhood of the selected input. See also the documentation for the Operator Explain Predictions.
  • Accuracy (bottom right): If a test data set was provided, and if it contains a label attribute, you will see how accurate the model works overall and for the currently predicted class (in case of classification).
  • Interpretation (bottom): A short summary of some major and outstanding points of all of the results above.

The simulator works well independent of the training data size. It has been successfully used for more than 10 Million data rows. The number of attributes has an impact though. It works well for less than 1,000 columns. In this case, the simulator provides all calculations in real time. For more than 1,000 columns, the real-time updates of the local feature importance is disabled. The automatic optimization of input features is disabled for more than 10,000 input features.

The model simulator supports all model types. The one exception are k-Nearest Neighbors models for massive amounts of training data since the model application time of this model type is too slow to support interactive, real-time exploration. Hence, we do not recommend to use the simulator or the optimization for k-Nearest Neighbors models.

Input

  • model (Model)

    This input port expects a model.

  • input (Data Table)

    This input port expects an ExampleSet identical to the one that trained the model.

  • input (Data Table)

    This input port expects an ExampleSet with test data. This data is optional.

Output

  • simulator output (Data Table)

    This port delivers the model simulator, used to simulate inputs and observe the model's behavior. It also provides an optimization algorithm which finds the optimal input needed to provide a desired output.

  • model (Model)

    The input model is passed without changing to the output through this port.

Tutorial Processes

Model Simulator for the Titanic data

This process trains a Naive Bayes model on the Titanic data. It then uses the Model Simulator operator to create a new user interface for simulating model input and observing the model's output in real-time. Can you find out how likely it is that you personally would survive when buying a third class ticket? Also, what is the best situation you could be in given your age and gender?