You are viewing the RapidMiner Studio documentation for version 8.0 - Check here for latest version

Important Terms

The following lists the first terms you need to know when using RapidMiner Studio. Following the terms are a description of the RapidMiner data types and operator port descriptions.

Attribute

The information elements describing a scenario. Attributes are the table columns of a data set.

Attribute

The example set included in this Getting Started guide has the attributes gender, age, payment method, last interaction, and churn.

Classification

The process of predicting which category (or class) an example belongs to, based on existing data for which category membership is known. A category is defined as the possible values for a label. (Similarly, regression is the process for predicting numerical results.) That is, with classification you construct a model that, when trained, uses the learned rules to predict the category of new data.

Classification

Each example in the data set falls into the category of either churning or not churning. The prediction of which category each example falls into, for those examples missing the label data, is derived from the rules learned during training.

Data set

The training set is the data used to discover predictive relationships and train models. The test set is the data used to test the accuracy and meaningfulness of a model's representation of the predictive relationship (typically discovered using the training set). The new data set is the data with missing labels; the rules derived from the training set are applied to predict outcome for the new data set.

Data sets

In this tutorial, you train and test your model using the customer-churn-data data set. Originally an Excel file, customer-churn-data became an available data set when you imported into RapidMiner.

Example

Characterized by its attributes, an example has concrete values that can be compared with other examples. Examples are the table rows of a data set.

Example

The example set customer-churn-data includes 993 examples (also known as rows). They are identified by a row number that RapidMiner prepends.

Example Set

The table created from the attributes (columns) and examples (rows). Also known as data or data set.

Example set

The example set used here is customer-churn-data, which originated from the file customer-churn-data.xslx.

Label

The identifying attribute in relation to the current question. The goal is to know or learn this attribute's (the label's) value, or learn rules for deriving it from the regular attributes, for each row in the example set. Sometimes referred to as the target attribute or variable, it is the thing to predict for new examples that are not yet characterized. There can be only one label per data set.

Label

Churn is the attribute of interest in this tutorial’s data set. Setting the role of the Churn attribute to label allows you to predict, for each example, whether the customer will cancel.

Model

The data mining method or prediction instruction. A model explains the discovered rules and/or predicts unknown situations for current and future examples.

Model

In this tutorial you created a model that predicts whether a customer will cancel. Your evaluation (validation) of the model returns accuracy percentages.

Operator

The building blocks, grouped by function, used to create RapidMiner processes. An operator has input and output ports; the action performed on the input ultimately leads to what is supplied to the output. Operator parameters control those actions. There are more than 1500 operators available in RapidMiner. Operators, in the Operators panel of the Design view, are both browsable and searchable.

Operator

In this tutorial you connect the Retrieve operator (which “retrieves” the data set) to the Filter Examples operator. The resulting labeled data set is connected to the Decision Tree operator to determine the set of rules RapidMiner will use to generate its predictions.

Panel

Each view has its own set of panels, or tools, related to the view. They can be moved, sized, and hidden to suit. You can access additional panels from the View > Show Panel pull-down menu:

Panels

See the graphic with callouts to identify panels. The following lists the default panels for each view:

Design: Operators, Repository, Process, Parameters, Help
Results: Repository, Result History
Hadoop Data (if the extension is installed): Hadoop Data, Hadoop Metadata, Hadoop Data Log

Parameter

The setting(s) whose value(s) determine the characteristics or behavior of an operator. RapidMiner presents parameters in the Parameters panel of the Design view. There are regular parameters and expert parameters. The expert parameters are indicated by italic names and are displayed or hidden by clicking the Show/Hide advanced parameters link at the bottom of the panel.

As part of the Wisdom of Crowds capabilities, RapidMiner Studio provides parameter recommendations based on the knowledge and best practices of other RapidMiner users. The recommender helps configure operators by providing recommendations on which parameters to change and by suggesting appropriate parameter values.

Parameter

This tutorial uses the filtering parameters of the Filter Examples operator to create a training data set.

Ports

The point through which data moves, represented by a semicircle labeled icon on the sides or operators and the Design view. See the list of port abbreviations below.

Ports

To see your filtered example set, connect the Output (out) port of the Retrieve operator to the ExampleSet (exa) port of Filter Examples. Then, connect the ExampleSet (exa) port on Filter Examples to the Results (res) port at the right of the Process view and click Run.

Prediction

The most probable value for a target attribute; predictions are derived by data mining. If you have rules and data, you can predict an outcome.

Prediction

The process in this tutorial may predict, for example: If the customer is male, over 54 years of age, and paid by credit card, then the probability of this customer canceling is high.

Process

A set of interconnected operators represented by a flow design, where each operator manipulates your data. A process might, for example, load a data set, transform the data, compute a model, and apply the model to another data set.

Process

This tutorial creates a process that retrieves a data set from the repository, filters the data to create a training set, applies a decision tree operator to derive rules for predictions, applies the model to unlabeled data, and runs validation to evaluate the model.

Process view

The working area for building processes. This is the canvas in the Design view where you drag operators or where, when you double-click a process, the operators of that process appear.

Process view

When building your process, you first dragged your data set, customer-churn-data, onto the Process panel. Next you added a Filter Examples operator and connected them.

Repository

The storage mechanism for data and RapidMiner processes. Best practice recommends you use the repository for data storage instead of reading directly from a file or database. If you use a Read operator, meta data will not be available to RapidMiner, limiting the available functions.

Repository

By default, RapidMiner Studio comes configured with a variety of sample data sets and process in the Samples directory of your repository. When this tutorial is complete, your Local Repository will include a new data set in new processes. From the Repository panel you can also access the Cloud Repository.

Role

The identifying tag for or function of an attribute. Roles tell RapidMiner of special meaning or treatment for an attribute. RapidMiner has several pre-defined roles and supports the ability to create your own roles. The label role is of utmost importance in defining the target for a prediction. Any attribute without a role assigned is known as a regular attribute.

Role

Apply the label role to the churn attribute. If the data set included row numbers, assign that attribute the id role. All other attributes are not assigned a role and are therefore regular attributes.

Training

The process of finding predictive relationships. The outcome of this learning process is the model.

Training

Assigning the label role to the Churn attribute creates a decision tree that considers the age, gender, payment method, and last purchase to create rules for the new data.

View

A "work area" in which you access a specific functionality. There are two pre-defined views. Some extensions can add their own views (for example, the Radoop Extension). You can also create your own view by clicking New view... in the View menu.

Views

See the graphic with callouts to locate each view:

Design: Canvas and tools for building and managing processes.
Results: Visualization, in many varied formats, of design process results.
Hadoop Data: Access to Radoop-related work.

RapidMiner data types

The following terms describe the data types RapidMiner assigns to attributes. Defining a data type specifies the kind of values allowed for an attribute. RapidMiner supports the natural division of numbers, texts, and dates. Numeric is the label for numbers, nominal for texts or strings, and date_time for dates.

attribute

Parent of all possible types ("any type").

binominal

Exactly two values (for example true/false or yes/no).

date

Date without time (for example 23.12.2014).

date_time

Both date and time (for example 23.12.2014 17:59).

file_path

Nominal data type (rarely used) that allows for more granular distinction. Can be used to mark a column as "only containing file paths."

integer

A whole number (for example, 23, -5, or 11,024,768).

nominal

All kinds of text values; includes polynomial and binomial.

numeric

All kinds of number values; includes date, time, integer, and real numbers.

polynominal

Many different string values (for example red, green, blue, yellow).

real

A fractional number (for example 11.23 or -0.0001).

text

Nominal data type that allows for more granular distinction (to differentiate from polynomial).

time

Time without date (for example 17:59).

Operator port information

The following table lists each port abbreviation and provides a brief description.

Port Abbreviation	Meaning	Description
ano	Anova	ANOVA matrix for ANOVA significance test
ann	Annotation	Annotations extracted from the input object
arc	Archive	Archive file generated during execution of the operator
ass	Association	Association rules that have been discovered in a frequent item set
att	Attribute	Attribute weights (in and out)
ave	Average	Performance measures; estimate of performance using the model built on the complete delivered data set
clu	Cluster model	Cluster model created when clustering an example set
clu	Clustered set	Example set given to the clustering operator; may contain an attribute with a cluster role (describes the cluster of each example)
col	Collection	Collection of objects
con	Condition	Any object can be supplied; the condition specified in parameters is tested on this object
cov	Covariance	Covariance matrix
dic	Dictionary	Example set used for replacing 'from' values with 'to' values in a given example set
dis	Distance measure	SimilarityMeasure object
doc	Document	Document or document set
err	Error	Standard error output
est	Estimated performance	Performance vector of the SVM model which gives an estimation of statistical performance of this model
exa	Example set	Example set
fil	File	File object
fla	Flat	Flat collection or flat clustering model
for	Formula	Formula result
fre	Frequent	Frequent item or item sets for association rule learning
gro	Grouped	Grouped models, attributes, items
hie	Hierarchical	Hierarchical clustering model
inp	Input	Input source, can take various objects
ite	Item sets	Frequent item sets (groups of items that often appear together in the data)
joi	Join	Join of the left and right example sets
lab	Labeled data	Model that was given in input is applied on the example set and the updated example set is delivered from this port
lef	Left	Left input port expecting an example set, which is used as the left example set for a join
lif	Lift chart	Lift Pareto chart for the given model and example set
mat	Matrix	Correlations matrix of all attributes of the input example set
mer	Merged	Merged example set
mod	Model	Default model from this output port
obj	Object	IO object
ori	Original	Input example set is passed without changing to this port
out	Output	Output port
par	Parameter set	Set of parameters that can be applied on an operator
pat	Patterns	GSP algorithm is applied on the given example set; resultant sequential patterns set is delivered through this port
per	Performance	Performance Vector for selected attributes
pre	Preprocessing	Preprocessing model with information regarding the operator's parameters in the current process
ran	Random forest	Model of a random forest
ref	Reference	Provided reference data or reference set
req	Request set	Provided example set
res	Result set	Distance or similarity between examples of the request set and reference set
rig	Right	Right input port expecting an example set, which is used as the right example set for a join
roc	ROC curve	Calculated ROC curves for included models
rul	Rules	Association rules that have been discovered in a frequent item set
sec	Second	Input take an example set derived from the output of the Generate ID operator in an attached example process
seg	Segment	Segment of an image
sel	Selected	Object specified by the index parameter is returned through this port
ses	Session	Session example set
sig	Significance	Significance test results of performance vector comparison is delivered through this port
sim	Similarity	Calculated similarity between each example of the given example set with every other example of the same set
sin	Single	Single object of the given collection, which is processed in the inner part of the operator
sta	Stacking	Stacking examples or model
sto	Stored	Through this port, the input object is passed without changing to the output
sub	Subtrahend	Expects an example set; example set must have ID attribute
sup	Superset	Superset of input example sets
thr	Through	Objects are passed through without changing
thr	Threshold	Threshold output of the Select Recall operator
tra	Training	Training data to train a model (example set)
uni	Union	Union of the input example sets
unl	Unlabeled	Examples that are not labelled and therefore not used when training a model
unm	Unmatched	Examples that did not match a specified pattern in the original example set
unr	Unrelated	Examples that were unrelated to a specified pattern in the original example set
vis	Visualization	Self-organizing map (SOM) visualization
wei	Weights	Attribute weights
wor	Word	Expects or outputs a word list
xsl	XSLT	EXtensible Stylesheet Language (XSLT) document

Categories

Versions

Important Terms

Attribute

Classification

Data set

Example

Example Set

Label

Model

Operator

Panel

Parameter

Ports

Prediction

Process

Process view

Repository

Role

Training

View

RapidMiner data types

attribute

binominal

date

date_time

file_path

integer

nominal

numeric

polynominal

real

text

time

Operator port information