Subgroup Discovery (RapidMiner Studio Core)
Synopsis
This operator performs an exhaustive subgroup discovery. The goal of subgroup discovery is to find rules describing subsets of the population that are sufficiently large and statistically unusual.Description
This operator discovers subgroups (or induces a rule set) by generating hypotheses exhaustively. Generation is done by stepwise refining the empty hypothesis (which contains no literals). The loop for this task hence iterates over the depth of the search space, i.e. the number of literals of the generated hypotheses. The maximum depth of the search can be specified by the max depth parameter. Furthermore the search space can be pruned by specifying a minimum coverage (by the min coverage parameter) of the hypothesis or by using only a given amount of hypotheses which have the highest coverage. From the hypotheses, rules are derived according to the user's preference. This operator allows the derivation of positive rules and negative rules separately or the combination by deriving both rules or only the one which is the most probable due to the examples covered by the hypothesis (hence: the actual prediction for that subset). This behavior can be controlled by the rule generation parameter. All generated rules are evaluated on the ExampleSet by a user specified utility function (which is specified by the utility function parameter) and stored in the final rule set if:
- They exceed a minimum utility threshold (which is specified by the min utility parameter) or
- They are among the k best rules (where k is specified by the k best rules parameter).
The problem of subgroup discovery has been defined as follows: Given a population of individuals and a property of those individuals we are interested in finding population subgroups that are statistically most interesting, e.g. are as large as possible and have the most unusual statistical (distributional) characteristics with respect to the property of interest. In subgroup discovery, rules have the form Class >- Cond, where the property of interest for subgroup discovery is the class value Class which appears in the rule consequent, and the rule antecedent Cond is a conjunction of features (attribute-value pairs) selected from the features describing the training instances. As rules are induced from labeled training instances (labeled positive if the property of interest holds, and negative otherwise), the process of subgroup discovery is targeted at uncovering properties of a selected target population of individuals with the given property of interest. In this sense, subgroup discovery is a form of supervised learning. However, in many respects subgroup discovery is a form of descriptive induction as the task is to uncover individual interesting patterns in data.
Rule learning is most frequently used in the context of classification rule learning and association rule learning. While classification rule learning is an approach to predictive induction (or supervised learning), aimed at constructing a set of rules to be used for classification and/or prediction, association rule learning is a form of descriptive induction (non- classification induction or unsupervised learning), aimed at the discovery of individual rules which define interesting patterns in data.
Let us emphasize the difference between subgroup discovery (as a task at the intersection of predictive and descriptive induction) and classification rule learning (as a form of predictive induction). The goal of standard rule learning is to generate models, one for each class, consisting of rule sets describing class characteristics in terms of properties occurring in the descriptions of training examples. In contrast, subgroup discovery aims at discovering individual rules or 'patterns' of interest, which must be represented in explicit symbolic form and which must be relatively simple in order to be recognized as actionable by potential users. Moreover, standard classification rule learning algorithms cannot appropriately address the task of subgroup discovery as they use the covering algorithm for rule set construction which hinders the applicability of classification rule induction approaches in subgroup discovery. Subgroup discovery is usually seen as different from classification, as it addresses different goals (discovery of interesting population subgroups instead of maximizing classification accuracy of the induced rule set).
Input
- training set (Data Table)
This input port expects an ExampleSet. It is the output of the Generate Nominal Data operator in the attached Example Process. The output of other operators can also be used as input.
Output
- model (Rule Set)
The Rule Set is delivered from this output port.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
Parameters
- modeThis parameter specifies the discovery mode.
- minimum_utility: If this option is selected the rules are stored in the final rule set if they exceed the minimum utility threshold specified by the min utility parameter
- k_best_rules: If this option is selected the rules are stored in the final rule set if they are among the k best rules (where k is specified by the k best rules parameter).
- utility_functionThis parameter specifies the desired utility function. Range: selection
- min_utilityThis parameter specifies the minimum utility. This parameter is useful when the mode parameter is set to 'minimum utility'. The rules are stored in the final rule set if they exceed the minimum utility threshold specified by this parameter. Range: real
- k_best_rulesThis parameter specifies the number of required best rules. This parameter is useful when the mode parameter is set to 'k best rules'. The rules are stored in the final rule set if they are among the k best rules where k is specified by this parameter. Range: integer
- rule_generationThis parameter determines which rules should be generated. This operator allows the derivation of positive rules and negative rules separately or the combination by deriving both rules or only the one which is the most probable due to the examples covered by the hypothesis (hence: the actual prediction for that subset). Range: selection
- max_depthThis parameter specifies the maximum depth of breadth-first search. The loop for this task iterates over the depth of the search space, i.e. the number of literals of the generated hypotheses. The maximum depth of the search can be specified by this parameter Range: integer
- min_coverageThis parameter specifies the minimum coverage. Only the rules which exceed this coverage threshold are considered. Range: real
- max_cacheThis parameter bounds the number of rules which are evaluated (only the most supported rules are used). Range: integer
Tutorial Processes
Introduction to the Subgroup Discovery operator
The Generate Nominal Data operator is used for generating an ExampleSet. The ExampleSet has two binominal attributes with 100 examples. The Subgroup Discovery operator is applied on this ExampleSet with default values of all parameters. The mode parameter is set to 'k best rules' and the k best rules parameter is set to 10. Moreover the utility function parameter is set to 'WRAcc'. Thus the Rule Set will be composed of 10 best rules where rules are evaluated by the WRAcc function. The resultant Rule Set can be seen in the Results Workspace. You can see that there are 10 rules and they are sorted in order of their WRAcc values.