FP-Growth (RapidMiner Studio Core)
SynopsisThis operator efficiently calculates all frequent itemsets from the given ExampleSet using the FP-tree data structure. It is compulsory that all attributes of the input ExampleSet should be binominal.
In simple words, frequent itemsets are groups of items that often appear together in the data. It is important to know the basics of market-basket analysis for understanding frequent itemsets.
The market-basket model of data is used to describe a common form of a many-to-many relationship between two kinds of objects. On the one hand, we have items, and on the other we have baskets, also called 'transactions'. The set of items is usually represented as set of attributes. Mostly these attributes are binominal. The transactions are usually each represented as examples of the ExampleSet. When an attribute value is 'true' in an example; it implies that the corresponding item is present in that transaction. Each transaction consists of a set of items (an itemset). Usually it is assumed that the number of items in a transaction is small, much smaller than the total number of items i.e. in most of the examples most of the attribute values are 'false'. The number of transactions is usually assumed to be very large i.e. the number of examples in the ExampleSet is assumed to be large. The frequent-itemsets problem is that of finding sets of items that appear together in at least a threshold ratio of transactions. This threshold is defined by the 'minimum support' criteria. The support of an itemset is the number of times that itemset appears in the ExampleSet divided by the total number of examples. The 'Purchases' data set at "Samples/data/Purchases" in the repository of RapidMiner is an example of how transactions data usually look like.
The discovery of frequent itemsets is often viewed as the discovery of 'association rules', although the latter is a more complex characterization of data, whose discovery depends fundamentally on the discovery of frequent itemsets. Association rules are derived from the frequent itemsets. The FP-Growth operator finds the frequent itemsets and operators like the Create Association Rules operator uses these frequent itemsets for calculating the association rules.
This operator calculates all frequent itemsets from an ExampleSet by building a FP-tree data structure on the transaction data base. This is a very compressed copy of the data which in many cases fits into main memory even for large data bases. All frequent itemsets are derived from this FP-tree. Many other frequent itemset mining algorithms also exist e.g. the Apriori algorithm. A major advantage of FP-Growth compared to Apriori is that it uses only 2 data scans and is therefore often applicable even on large data sets.
Please note that the given ExampleSet should contain only binominal attributes, i.e. nominal attributes with only two different values. If your ExampleSet does not satisfy this condition, you may use appropriate preprocessing operators to transform it into the required form. The discretization operators can be used for changing the value of numerical attributes to nominal attributes. Then the Nominal to Binominal operator can be used for transforming nominal attributes into binominal attributes.
Please note that the frequent itemsets are mined for the positive entries in your ExampleSet, i.e. for those nominal values which are defined as positive in your ExampleSet. If your data does not specify the positive entries correctly, you may set them using the positive value parameter. This only works if all your attributes contain this value.
This operator has two basic working modes:
- finding at least the specified number of itemsets with highest support without taking the 'min support' into account. This mode is available when the find min number of itemsets parameter is set to true. Then this operator finds the number of itemsets specified in the min number of itemsets parameter. The min support parameter is ignored in this case.
- finding all itemsets with a support larger than the specified minimum support. The minimum support is specified through the min support parameter. This mode is available when the find min number of itemsets parameter is set to false.
- example set (Data Table)
This input port expects an ExampleSet. Please make sure that all attributes of the ExampleSet are binominal.
- example set (Data Table)
The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.
- frequent sets (Frequent Item Sets)
The frequent itemsets are delivered through this port. Operators like the Create Association Rules operator can use these frequent itemsets to generate association rules.
- find_min_number_of_itemsets If this parameter is set to true, this operator finds at least the specified number of itemsets with highest support without taking the min support parameter into account. This operator finds (at least) the number of itemsets specified in the min number of itemsets parameter. The min support parameter is ignored to some extent in this case. The minimal support is decreased automatically until the specified minimum number of frequent itemsets is found. The defined minimal support is lowered by 20 percent each time. Range: boolean
- min_number_of_itemsetsThis parameter is only available when the find min number of itemsets parameter is set true. This parameter specifies the minimum number of itemsets which should be mined. Range: integer
- max_number_of_retriesThis parameter is only available when the find min number of itemsets parameter is set true. This parameter determines how many times the operator should lower the minimal support to find the minimal number of item sets. Each time the minimal support is lowered by 20 percent. Range: integer
- positive_valueThis parameter determines which value of the binominal attributes should be treated as positive. The attributes with that value are considered as part of a transaction. If left blank, the ExampleSet determines which value is used. Range: string
- min_supportThe minimum support criteria is specified by this parameter. Please study the description of this operator for more information about minimum support. Range: real
- max_itemsThis parameter specifies the upper bound for the length of the itemsets i.e. the maximum number of items in an itemset. If set to -1, this parameter imposes no upper bound. Range: integer
- must_containThis parameter specifies the items that should be part of frequent itemsets. It is specified through a regular expression. If there is no specific item that you want to have in the frequent itemset, you can leave this blank. Range:
Introduction to the FP-Growth operator
The 'Iris' data set is loaded using the Retrieve operator. A breakpoint is inserted here so that you can view the ExampleSet. As you can see, the ExampleSet has real attributes. Thus the FP-Growth operator cannot be applied on it directly because the FP-Growth operator requires all attributes to be binominal. We have to do some preprocessing to mold the ExampleSet into the desired form. The Discretize by Frequency operator is applied to change the real attributes to nominal attributes. Then the Nominal to Binominal operator is applied to change these nominal attributes to binominal attributes. Finally, the FP-Growth operator is applied to generate frequent itemsets. The frequent itemsets can be viewed in the Results Workspace. Run this process with different values for different parameters to get a better understanding of this operator.