FP-Growth (Concurrency)

Synopsis

This Operator efficiently calculates all frequently-occurring itemsets in an ExampleSet, using the FP-tree data structure.

Description

When online shopping, you will sometimes get a suggestion of the following form: "Customers who bought item X also bought item Y." This suggestion is an example of an association rule. To derive it, you first have to know which items on the market most frequently co-occur in customers' shopping baskets, and here the FP-Growth algorithm has a role to play.

The FP-Growth algorithm is an efficient algorithm for calculating frequently co-occurring items in a transaction database. To understand how it works, let's start with some terminology, using a customer transaction as an example:

  • item - any object that is sold on the market
  • basket - a container for one or more items selected by the customer
  • itemset - any subset of items that are sold together, in the same shopping basket
  • transaction - the complete set of items in an individual shopping basket, at the moment of purchase
  • transaction database - the complete set of shopping baskets / transactions recorded by the merchant

Here, the words "basket" and "transaction" are used interchangably, because we identify the customer's shopping basket with the items that were purchased. To make these definitions concrete, consider the following transaction database:

  • transaction1 = (product1, product2, product7)
  • transaction2 = (product2, product5, product7)
  • transaction3 = (product6, product7, product8, product9)
  • transaction4 = (product1, product3, product4, product6, product7)

Nine distinct items are for sale, and there are four baskets / transactions, with a varying number of items. The item appearing most frequently, product7, appears four times in the transactions database. Each of the following itemsets occurs twice: (product1, product7), (product2, product7), (product6, product7).

An FP-tree data structure can be efficiently created, compressing the data so much that, in many cases, even large databases will fit into main memory. In the example above, the FP-tree would have product7, the most frequently occurring product, next to the root, with branches from product7 to product1, product2, and product6. If we insist that a product must appear more than once in the transaction database, then the remaining products are excluded from the FP-tree. The transaction database might have started out as a 4 x 9 (transactions x products) data table, with many zero entries, but now it is reduced to a minimalistic tree that captures only the relevant frequency data.

Even with an efficient tree structure, the number of itemsets considered by the algorithm can grow very large. With the help of the parameter max number of itemsets, you can if necessary reduce runtime and memory.

Remember that online shopping is merely an example; the FP-Growth algorithm can be applied to any problem that can be formulated in terms of items, itemsets, and baskets / transactions. The typical setting for the algorithm is a large transaction database (many baskets), with only a small number of items in each basket -- small compared to the set of all items.

  • support = (Number of times an item or itemset appears in the database) / (Number of baskets in the database)

In general, the concept of "minimum support" creates a cutoff, defining what is meant by frequent or not-so-frequent occurrences of an itemset. If an item or an itemset appears in only a few baskets, it is excluded, via the parameters min support or min frequency. The exclusion of infrequently-occurring items and itemsets helps to compress the data and improves the statistical significance of the results. On the other hand, if the value for min support or min frequency is set too high, the algorithm may find zero itemsets. Hence, this Operator provides two major modes, via the checkbox find min number of itemsets:

1. if unchecked, with a fixed minimum support value, and

2. if checked, with a dynamic minimum support value, to ensure that the result includes a minimum number of itemsets.

FP-Growth supports several different formats for the input data. Please note the following requirements:

  • in the ExampleSet, one transaction = one row. For a discussion of the columns, see below.
  • all the item values must be nominal
  • a transaction ID is optional and, if present, it should have the "id" role, so that it is not identified as an item.

For the columns, the three available input formats are illustrated in the second tutorial, together with necessary pre-processing. Here's the summary:

  • item list in a column: All the items belonging to a transaction appear in a single column, separated by item separators, in a CSV-like format. As with CSV files, the items can be quoted, and escape characters are available. You can trim item names.
  • items in separate columns: All the items belonging to a transaction appear in separate columns. For each transaction, the first item name appears in the first column, the second item name in the second column, etc. The number of columns corresponds to the basket with the maximum number of items. Missing values indicate no item. You can trim item names.
  • items in dummy coded columns: Every item in the set of all items has its own column, and the item name is the column name. For each transaction, the binominal values (true/false) indicate whether the item can be found in the basket. If your data is binominal but does not identify the values as true/false, you may have to set the positive value parameter.

Input

  • example set (IOObject)

    This input port expects an ExampleSet. As discussed in detail in the description, this Operator supports several different formats for the input data.

Output

  • example set (IOObject)

    The ExampleSet that was given as input is passed through without changes.

  • frequent sets (Frequent Item Sets)

    The frequently-occurring itemsets are delivered through this port. Operators such as Create Association Rules can use these frequently-occuring itemsets to generate association rules.

Parameters

  • input_format

    See the second tutorial for examples. As discussed in detail in the description, this Operator supports several different formats for the input data.

    • item list in a column: All the items belonging to a transaction appear in a single column, separated by item separators, in a CSV-like format.
    • items in separate columns: All the items belonging to a transaction appear in separate columns, with the first item name appearing in the first column, the second item name in the second column, etc.
    • items in dummy coded columns: Every item in the set of all items has its own column, and the item name is the column name. For each transaction, the binominal values (true/false) indicate whether the item can be found in the basket.
    Range:
  • item_separators

    This parameter defines the item separator. It can also be provided as a regular expression.

    Range:
  • use_quotes

    Check this parameter to define a quotes character. As in CSV files, if item separators are likely to appear in the item name, quotes can be used to prevent confusion. For example if (,) is the item separator and (") is the quotes character, then the row (a,b,c,d) will be interpreted as 4 items. On the other hand, ("a,b,c,d") will be interpreted as a single item, with value a,b,c,d.

    Range:
  • quotes_character

    This parameter defines the quotes character and is only available if use quotes is checked.

    Range:
  • escape_character

    This parameter defines the escape character, used to escape the quotes character or the item separator. For example, if (") is the quotes character and ('\') is the escape character, then ("yes") is interpreted as (yes) and (\"yes\") is interpreted as ("yes"). If ('|') is the item separator and ('\') is the escape character, then a row (a\|b|c) is interpreted as two items, (a|b) and (c).

    Range:
  • trim_item_names

    If this parameter is checked, whitespace at the beginning and the end of item names is deleted.

    Range:
  • positive_value

    In the case of items in dummy coded columns, with binominal Attributes, this parameter determines which value should be treated as positive, and hence which items belong to a transaction. If this parameter is left blank, the positive value is inferred from the ExampleSet.

    Range:
  • min_requirement

    This parameter makes available two different methods for defining a cutoff, eliminating infrequently-occurring itemsets.

    • support: The minimum support value (ratio of occurrences to ExampleSet size)
    • frequency: The minimum frequency (number of occurrences)
    Range:
  • min_support

    Minimum support = (number of occurrences of an itemset) / (size of the ExampleSet)

    Decrease this value to increase the number of itemsets in the result.

    Range:
  • min_frequency

    Minimum frequency = number of occurrences of an itemset

    Decrease this value to increase the number of itemsets in the result.

    Range:
  • min_items_per_itemset

    The lower bound for the size of an itemset.

    Range:
  • max_items_per_itemset

    The upper bound for the size of an itemset (0: no upper bound).

    Range:
  • max_number_of_itemsets

    The upper bound for the number of itemsets (0: no upper bound). If you run out of memory, either decrease this value or increase the value for min support or min frequency.

    Range:
  • find_min_number_of_itemsets

    If this parameter is checked, the results will contain at least a minimum number of itemsets, those with highest support. The minimum support value is automatically decreased until the minimum number of itemsets is found.

    Range:
  • min_number_of_itemsets

    This parameter is only available when find min number of itemsets is checked. This parameter specifies the minimum number of itemsets that should be included in the results.

    Range:
  • max_number_of_retries

    This parameter is only available when find min number of itemsets is checked. When automatically decreasing the value for minimum support / minimum frequency, this parameter determines how many times the Operator may decrease the value before giving up. Increase this number to get more results.

    Range:
  • requirement_decrease_factor

    This parameter is only available when find min number of itemsets is checked. When automatically decreasing the value for minimum support / minimum frequency, this multiplicative factor determines the new cutoff value. A lower value results in fewer steps to find the desired number of itemsets.

    Range:
  • must_contain_list

    This parameter specifies items that must be included in the frequently-occurring itemsets, if any, via a list of exact item names.

    Range:
  • must_contain_regexp

    This parameter specifies items that must be included in the frequently-occurring itemsets, if any, via a regular expression.

    Range:

Tutorial Processes

Introduction to the FP-Growth Operator

The process shows a market basket analysis. A data set containing transactions is loaded using the Retrieve Operator. A breakpoint is inserted here so that you can view the ExampleSet. We have to do some preprocessing using the Aggregate Operator to mold the ExampleSet into an acceptable input format. A breakpoint is inserted before the FP-Growth Operator so that you can view the input data. The FP-Growth Operator is applied to generate frequent itemsets. Finally, the Create Association Rules Operator is used to create rules from the frequent item sets. The frequent itemsets and the association rules can be viewed in the Results View. Run this process with different values of the parameters to get a better understanding of this Operator.

The input formats of the FP-Growth Operator

Data is loaded and transformed to three different input formats. A breakpoint is inserted before the FP-Growth Operators so that you can see the input data in each of these formats. The FP-Growth Operator is used and the resulting itemsets can be viewed in the Results View. The results are all the same because the input data is the same, despite the difference in formats.