The information elements describing a scenario. Attributes are the table columns of a data set.
The example set included in this Getting Started guide has the attributes gender, age, payment method, last interaction, and churn.
The process of predicting which category (or class) an example belongs to, based on existing data for which category membership is known. A category is defined as the possible values for a label. (Similarly, regression is the process for predicting numerical results.) That is, with classification you construct a model that, when trained, uses the learned rules to predict the category of new data.
Each example in the data set falls into the category of either churning or not churning. The prediction of which category each example falls into, for those examples missing the label data, is derived from the rules learned during training.
The training set is the data used to discover predictive relationships and train models. The test set is the data used to test the accuracy and meaningfulness of a model's representation of the predictive relationship (typically discovered using the training set). The new data set is the data with missing labels; the rules derived from the training set are applied to predict outcome for the new data set.
In this tutorial, you train and test your model using the customer-churn-data data set. Originally an Excel file, customer-churn-data became an available data set when you imported into RapidMiner.
Characterized by its attributes, an example has concrete values that can be compared with other examples. Examples are the table rows of a data set.
The example set customer-churn-data includes 993 examples (also known as rows). They are identified by a row number that RapidMiner prepends.
The table created from the attributes (columns) and examples (rows). Also known as data or data set.
The example set used here is customer-churn-data, which originated from the file customer-churn-data.xslx.
The identifying attribute in relation to the current question. The goal is to know or learn this attribute's (the label's) value, or learn rules for deriving it from the regular attributes, for each row in the example set. Sometimes referred to as the target attribute or variable, it is the thing to predict for new examples that are not yet characterized. There can be only one label per data set.
Churn is the attribute of interest in this tutorial’s data set. Setting the role of the Churn attribute to label allows you to predict, for each example, whether the customer will cancel.
The data mining method or prediction instruction. A model explains the discovered rules and/or predicts unknown situations for current and future examples.
In this tutorial you created a model that predicts whether a customer will cancel. Your evaluation (validation) of the model returns accuracy percentages.
The building blocks, grouped by function, used to create RapidMiner processes. An operator has input and output ports; the action performed on the input ultimately leads to what is supplied to the output. Operator parameters control those actions. There are more than 1500 operators available in RapidMiner. Operators, in the Operators panel of the Design view, are both browsable and searchable.
In this tutorial you connect the Retrieve operator (which “retrieves” the data set) to the Filter Examples operator. The resulting labeled data set is connected to the Decision Tree operator to determine the set of rules RapidMiner will use to generate its predictions.
Each view has its own set of panels, or tools, related to the view. They can be moved, sized, and hidden to suit. You can access additional panels from the View > Show Panel pull-down menu:
See the graphic with callouts to identify panels. The following lists the default panels for each view:
- Design: Operators, Repository, Process, Parameters, Help
- Results: Repository, Result History
- Hadoop Data (if the extension is installed): Hadoop Data, Hadoop Metadata, Hadoop Data Log
The setting(s) whose value(s) determine the characteristics or behavior of an operator. RapidMiner presents parameters in the Parameters panel of the Design view. There are regular parameters and expert parameters. The expert parameters are indicated by italic names and are displayed or hidden by clicking the Show/Hide advanced parameters link at the bottom of the panel.
As part of the Wisdom of Crowds capabilities, RapidMiner Studio provides parameter recommendations based on the knowledge and best practices of other RapidMiner users. The recommender helps configure operators by providing recommendations on which parameters to change and by suggesting appropriate parameter values.
This tutorial uses the filtering parameters of the Filter Examples operator to create a training data set.
The point through which data moves, represented by a semicircle labeled icon on the sides or operators and the Design view. See the list of port abbreviations below.
To see your filtered example set, connect the Output (out) port of the Retrieve operator to the ExampleSet (exa) port of Filter Examples. Then, connect the ExampleSet (exa) port on Filter Examples to the Results (res) port at the right of the Process view and click Run.
The most probable value for a target attribute; predictions are derived by data mining. If you have rules and data, you can predict an outcome.
The process in this tutorial may predict, for example: If the customer is male, over 54 years of age, and paid by credit card, then the probability of this customer canceling is high.
A set of interconnected operators represented by a flow design, where each operator manipulates your data. A process might, for example, load a data set, transform the data, compute a model, and apply the model to another data set.
This tutorial creates a process that retrieves a data set from the repository, filters the data to create a training set, applies a decision tree operator to derive rules for predictions, applies the model to unlabeled data, and runs validation to evaluate the model.
The working area for building processes. This is the canvas in the Design view where you drag operators or where, when you double-click a process, the operators of that process appear.
When building your process, you first dragged your data set, customer-churn-data, onto the Process panel. Next you added a Filter Examples operator and connected them.
The storage mechanism for data and RapidMiner processes. Best practice recommends you use the repository for data storage instead of reading directly from a file or database. If you use a Read operator, meta data will not be available to RapidMiner, limiting the available functions.
By default, RapidMiner Studio comes configured with a variety of sample data sets and process in the Samples directory of your repository. When this tutorial is complete, your Local Repository will include a new data set in new processes. From the Repository panel you can also access the Cloud Repository.
The identifying tag for or function of an attribute. Roles tell RapidMiner of special meaning or treatment for an attribute. RapidMiner has several pre-defined roles and supports the ability to create your own roles. The label role is of utmost importance in defining the target for a prediction. Any attribute without a role assigned is known as a regular attribute.
Apply the label role to the churn attribute. If the data set included row numbers, assign that attribute the id role. All other attributes are not assigned a role and are therefore regular attributes.
The process of finding predictive relationships. The outcome of this learning process is the model.
Assigning the label role to the Churn attribute creates a decision tree that considers the age, gender, payment method, and last purchase to create rules for the new data.
A "work area" in which you access a specific functionality. There are two pre-defined views. Some extensions can add their own views (for example, the Radoop Extension). You can also create your own view by clicking New view... in the View menu.
See the graphic with callouts to locate each view:
- Design: Canvas and tools for building and managing processes.
- Results: Visualization, in many varied formats, of design process results.
- Hadoop Data: Access to Radoop-related work.
RapidMiner data types
The following terms describe the data types RapidMiner assigns to attributes. Defining a data type specifies the kind of values allowed for an attribute. RapidMiner supports the natural division of numbers, texts, and dates. Numeric is the label for numbers, nominal for texts or strings, and date_time for dates.
Parent of all possible types ("any type").
Exactly two values (for example true/false or yes/no).
Date without time (for example 23.12.2014).
Both date and time (for example 23.12.2014 17:59).
Nominal data type (rarely used) that allows for more granular distinction. Can be used to mark a column as "only containing file paths."
A whole number (for example, 23, -5, or 11,024,768).
All kinds of text values; includes polynomial and binomial.
All kinds of number values; includes date, time, integer, and real numbers.
Many different string values (for example red, green, blue, yellow).
A fractional number (for example 11.23 or -0.0001).
Nominal data type that allows for more granular distinction (to differentiate from polynomial).
Time without date (for example 17:59).
Operator port information
The following table lists each port abbreviation and provides a brief description.
|ano||Anova||ANOVA matrix for ANOVA significance test|
|ann||Annotation||Annotations extracted from the input object|
|arc||Archive||Archive file generated during execution of the operator|
|ass||Association||Association rules that have been discovered in a frequent item set|
|att||Attribute||Attribute weights (in and out)|
|ave||Average||Performance measures; estimate of performance using the model built on the complete delivered data set|
|clu||Cluster model||Cluster model created when clustering an example set|
|clu||Clustered set||Example set given to the clustering operator; may contain an attribute with a cluster role (describes the cluster of each example)|
|col||Collection||Collection of objects|
|con||Condition||Any object can be supplied; the condition specified in parameters is tested on this object|
|dic||Dictionary||Example set used for replacing 'from' values with 'to' values in a given example set|
|dis||Distance measure||SimilarityMeasure object|
|doc||Document||Document or document set|
|err||Error||Standard error output|
|est||Estimated performance||Performance vector of the SVM model which gives an estimation of statistical performance of this model|
|exa||Example set||Example set|
|fla||Flat||Flat collection or flat clustering model|
|fre||Frequent||Frequent item or item sets for association rule learning|
|gro||Grouped||Grouped models, attributes, items|
|hie||Hierarchical||Hierarchical clustering model|
|inp||Input||Input source, can take various objects|
|ite||Item sets||Frequent item sets (groups of items that often appear together in the data)|
|joi||Join||Join of the left and right example sets|
|lab||Labeled data||Model that was given in input is applied on the example set and the updated example set is delivered from this port|
|lef||Left||Left input port expecting an example set, which is used as the left example set for a join|
|lif||Lift chart||Lift Pareto chart for the given model and example set|
|mat||Matrix||Correlations matrix of all attributes of the input example set|
|mer||Merged||Merged example set|
|mod||Model||Default model from this output port|
|ori||Original||Input example set is passed without changing to this port|
|par||Parameter set||Set of parameters that can be applied on an operator|
|pat||Patterns||GSP algorithm is applied on the given example set; resultant sequential patterns set is delivered through this port|
|per||Performance||Performance Vector for selected attributes|
|pre||Preprocessing||Preprocessing model with information regarding the operator's parameters in the current process|
|ran||Random forest||Model of a random forest|
|ref||Reference||Provided reference data or reference set|
|req||Request set||Provided example set|
|res||Result set||Distance or similarity between examples of the request set and reference set|
|rig||Right||Right input port expecting an example set, which is used as the right example set for a join|
|roc||ROC curve||Calculated ROC curves for included models|
|rul||Rules||Association rules that have been discovered in a frequent item set|
|sec||Second||Input take an example set derived from the output of the Generate ID operator in an attached example process|
|seg||Segment||Segment of an image|
|sel||Selected||Object specified by the index parameter is returned through this port|
|ses||Session||Session example set|
|sig||Significance||Significance test results of performance vector comparison is delivered through this port|
|sim||Similarity||Calculated similarity between each example of the given example set with every other example of the same set|
|sin||Single||Single object of the given collection, which is processed in the inner part of the operator|
|sta||Stacking||Stacking examples or model|
|sto||Stored||Through this port, the input object is passed without changing to the output|
|sub||Subtrahend||Expects an example set; example set must have ID attribute|
|sup||Superset||Superset of input example sets|
|thr||Through||Objects are passed through without changing|
|thr||Threshold||Threshold output of the Select Recall operator|
|tra||Training||Training data to train a model (example set)|
|uni||Union||Union of the input example sets|
|unl||Unlabeled||Examples that are not labelled and therefore not used when training a model|
|unm||Unmatched||Examples that did not match a specified pattern in the original example set|
|unr||Unrelated||Examples that were unrelated to a specified pattern in the original example set|
|vis||Visualization||Self-organizing map (SOM) visualization|
|wor||Word||Expects or outputs a word list|
|xsl||XSLT||EXtensible Stylesheet Language (XSLT) document|