Categories

Versions

Important Terms

The following lists the first terms you need to know when using RapidMiner Studio. Following the terms are a description of the RapidMiner data types and operator port descriptions.

Attribute

The information elements describing a scenario. Attributes are the table columns of a data set.

The example set included in this Getting Started guide has the attributes gender, age, payment method, last interaction, and churn.

Classification

The process of predicting which category (or class) an example belongs to, based on existing data for which category membership is known. A category is defined as the possible values for a label. (Similarly, regression is the process for predicting numerical results.) That is, with classification you construct a model that, when trained, uses the learned rules to predict the category of new data.

Each example in the data set falls into the category of either churning or not churning. The prediction of which category each example falls into, for those examples missing the label data, is derived from the rules learned during training.

Data set

The training set is the data used to discover predictive relationships and train models. The test set is the data used to test the accuracy and meaningfulness of a model's representation of the predictive relationship (typically discovered using the training set). The new data set is the data with missing labels; the rules derived from the training set are applied to predict outcome for the new data set.

In this tutorial, you train and test your model using the customer-churn-data data set. Originally an Excel file, customer-churn-data became an available data set when you imported into RapidMiner.

Example

Characterized by its attributes, an example has concrete values that can be compared with other examples. Examples are the table rows of a data set.

The example set customer-churn-data includes 993 examples (also known as rows). They are identified by a row number that RapidMiner prepends.

Example Set

The table created from the attributes (columns) and examples (rows). Also known as data or data set.

The example set used here is customer-churn-data, which originated from the file customer-churn-data.xslx.

Label

The identifying attribute in relation to the current question. The goal is to know or learn this attribute's (the label's) value, or learn rules for deriving it from the regular attributes, for each row in the example set. Sometimes referred to as the target attribute or variable, it is the thing to predict for new examples that are not yet characterized. There can be only one label per data set.

Churn is the attribute of interest in this tutorial’s data set. Setting the role of the Churn attribute to label allows you to predict, for each example, whether the customer will cancel.

Model

The data mining method or prediction instruction. A model explains the discovered rules and/or predicts unknown situations for current and future examples.

In this tutorial you created a model that predicts whether a customer will cancel. Your evaluation (validation) of the model returns accuracy percentages.

Operator

The building blocks, grouped by function, used to create RapidMiner processes. An operator has input and output ports; the action performed on the input ultimately leads to what is supplied to the output. Operator parameters control those actions. There are more than 1500 operators available in RapidMiner. Operators, in the Operators panel of the Design view, are both browsable and searchable.

In this tutorial you connect the Retrieve operator (which “retrieves” the data set) to the Filter Examples operator. The resulting labeled data set is connected to the Decision Tree operator to determine the set of rules RapidMiner will use to generate its predictions.

Panel

Each view has its own set of panels, or tools, related to the view. They can be moved, sized, and hidden to suit. You can access additional panels from the View > Show Panel pull-down menu:

See the graphic with callouts to identify panels. The following lists the default panels for each view:

  • Design: Operators, Repository, Process, Parameters, Help
  • Results: Repository, Result History
  • Hadoop Data (if the extension is installed): Hadoop Data, Hadoop Metadata, Hadoop Data Log

Parameter

The setting(s) whose value(s) determine the characteristics or behavior of an operator. RapidMiner presents parameters in the Parameters panel of the Design view. There are regular parameters and expert parameters. The expert parameters are indicated by italic names and are displayed or hidden by clicking the Show/Hide advanced parameters link at the bottom of the panel.

As part of the Wisdom of Crowds capabilities, RapidMiner Studio provides parameter recommendations based on the knowledge and best practices of other RapidMiner users. The recommender helps configure operators by providing recommendations on which parameters to change and by suggesting appropriate parameter values.

This tutorial uses the filtering parameters of the Filter Examples operator to create a training data set.

Ports

The point through which data moves, represented by a semicircle labeled icon on the sides or operators and the Design view. See the list of port abbreviations below.

To see your filtered example set, connect the Output (out) port of the Retrieve operator to the ExampleSet (exa) port of Filter Examples. Then, connect the ExampleSet (exa) port on Filter Examples to the Results (res) port at the right of the Process view and click Run arrowRun.

Prediction

The most probable value for a target attribute; predictions are derived by data mining. If you have rules and data, you can predict an outcome.

The process in this tutorial may predict, for example: If the customer is male, over 54 years of age, and paid by credit card, then the probability of this customer canceling is high.

Process

A set of interconnected operators represented by a flow design, where each operator manipulates your data. A process might, for example, load a data set, transform the data, compute a model, and apply the model to another data set.

This tutorial creates a process that retrieves a data set from the repository, filters the data to create a training set, applies a decision tree operator to derive rules for predictions, applies the model to unlabeled data, and runs validation to evaluate the model.

Process view

The working area for building processes. This is the canvas in the Design view where you drag operators or where, when you double-click a process, the operators of that process appear.

When building your process, you first dragged your data set, customer-churn-data, onto the Process panel. Next you added a Filter Examples operator and connected them.

Repository

The storage mechanism for data, RapidMiner processes and, starting with 9.7, everything else. Best practice recommends you use the repository for data storage instead of reading directly from a file or database. If you use a Read operator, meta data will not be available to RapidMiner, limiting the available functions.

By default, RapidMiner Studio comes configured with a variety of sample data sets and process in the Samples directory of your repository. When this tutorial is complete, your Local Repository will include data, processes and Connections folders. If you have access to a RapidMiner AI Hub, the Repository panel gives access to the RapidMiner AI Hub Repository and from 9.7 on there you can connect to versioned Projects stored on RapidMiner AI Hub.

Role

The identifying tag for or function of an attribute. Roles tell RapidMiner of special meaning or treatment for an attribute. RapidMiner has several pre-defined roles and supports the ability to create your own roles. The label role is of utmost importance in defining the target for a prediction. Any attribute without a role assigned is known as a regular attribute.

Apply the label role to the churn attribute. If the data set included row numbers, assign that attribute the id role. All other attributes are not assigned a role and are therefore regular attributes.

Training

The process of finding predictive relationships. The outcome of this learning process is the model.

Assigning the label role to the Churn attribute creates a decision tree that considers the age, gender, payment method, and last purchase to create rules for the new data.

View

A "work area" in which you access a specific functionality. There are two pre-defined views. Some extensions can add their own views (for example, the Radoop Extension). You can also create your own view by clicking New view... in the View menu.

See the graphic with callouts to locate each view:

  • Design: Canvas and tools for building and managing processes.
  • Results: Visualization, in many varied formats, of design process results.
  • Hadoop Data: Access to Radoop-related work.

RapidMiner data types

The following terms describe the data types RapidMiner assigns to attributes. Defining a data type specifies the kind of values allowed for an attribute. RapidMiner supports the natural division of numbers, texts, and dates. Numeric is the label for numbers, nominal for texts or strings, and date_time for dates.

attribute

Parent of all possible types ("any type").

binominal

Exactly two values (for example true/false or yes/no).

date

Date without time (for example 23.12.2014).

date_time

Both date and time (for example 23.12.2014 17:59).

file_path

Nominal data type (rarely used) that allows for more granular distinction. Can be used to mark a column as "only containing file paths."

integer

A whole number (for example, 23, -5, or 11,024,768).

nominal

All kinds of text values; includes polynomial and binomial.

numeric

All kinds of number values; includes date, time, integer, and real numbers.

polynominal

Many different string values (for example red, green, blue, yellow).

real

A fractional number (for example 11.23 or -0.0001).

text

Nominal data type that allows for more granular distinction (to differentiate from polynomial).

time

Time without date (for example 17:59).

Operator port information

The following table lists each port abbreviation and provides a brief description.

Port Abbreviation Meaning Description
ano Anova ANOVA matrix for ANOVA significance test
ann Annotation Annotations extracted from the input object
arc Archive Archive file generated during execution of the operator
ass Association Association rules that have been discovered in a frequent item set
att Attribute Attribute weights (in and out)
ave Average Performance measures; estimate of performance using the model built on the complete delivered data set
clu Cluster model Cluster model created when clustering an example set
clu Clustered set Example set given to the clustering operator; may contain an attribute with a cluster role (describes the cluster of each example)
col Collection Collection of objects
con Condition Any object can be supplied; the condition specified in parameters is tested on this object
cov Covariance Covariance matrix
dic Dictionary Example set used for replacing 'from' values with 'to' values in a given example set
dis Distance measure SimilarityMeasure object
doc Document Document or document set
err Error Standard error output
est Estimated performance Performance vector of the SVM model which gives an estimation of statistical performance of this model
exa Example set Example set
fil File File object
fla Flat Flat collection or flat clustering model
for Formula Formula result
fre Frequent Frequent item or item sets for association rule learning
gro Grouped Grouped models, attributes, items
hie Hierarchical Hierarchical clustering model
inp Input Input source, can take various objects
ite Item sets Frequent item sets (groups of items that often appear together in the data)
joi Join Join of the left and right example sets
lab Labeled data Model that was given in input is applied on the example set and the updated example set is delivered from this port
lef Left Left input port expecting an example set, which is used as the left example set for a join
lif Lift chart Lift Pareto chart for the given model and example set
mat Matrix Correlations matrix of all attributes of the input example set
mer Merged Merged example set
mod Model Default model from this output port
obj Object IO object
ori Original Input example set is passed without changing to this port
out Output Output port
par Parameter set Set of parameters that can be applied on an operator
pat Patterns GSP algorithm is applied on the given example set; resultant sequential patterns set is delivered through this port
per Performance Performance Vector for selected attributes
pre Preprocessing Preprocessing model with information regarding the operator's parameters in the current process
ran Random forest Model of a random forest
ref Reference Provided reference data or reference set
req Request set Provided example set
res Result set Distance or similarity between examples of the request set and reference set
rig Right Right input port expecting an example set, which is used as the right example set for a join
roc ROC curve Calculated ROC curves for included models
rul Rules Association rules that have been discovered in a frequent item set
sec Second Input take an example set derived from the output of the Generate ID operator in an attached example process
seg Segment Segment of an image
sel Selected Object specified by the index parameter is returned through this port
ses Session Session example set
sig Significance Significance test results of performance vector comparison is delivered through this port
sim Similarity Calculated similarity between each example of the given example set with every other example of the same set
sin Single Single object of the given collection, which is processed in the inner part of the operator
sta Stacking Stacking examples or model
sto Stored Through this port, the input object is passed without changing to the output
sub Subtrahend Expects an example set; example set must have ID attribute
sup Superset Superset of input example sets
thr Through Objects are passed through without changing
thr Threshold Threshold output of the Select Recall operator
tra Training Training data to train a model (example set)
uni Union Union of the input example sets
unl Unlabeled Examples that are not labelled and therefore not used when training a model
unm Unmatched Examples that did not match a specified pattern in the original example set
unr Unrelated Examples that were unrelated to a specified pattern in the original example set
vis Visualization Self-organizing map (SOM) visualization
wei Weights Attribute weights
wor Word Expects or outputs a word list
xsl XSLT EXtensible Stylesheet Language (XSLT) document