Pivot (Blending)
Synopsis
The Pivot Operator creates a pivot table, summarizing the data in a larger table by reorganizing it into groups and calculating sums, averages, or other statistics for each group.Description
The most common example of a pivot table is the sales summary report. Imagine that you sell goods in numerous stores, and you want to know the total sales in each store during every month of the year. You have a data table listing all the transactions that occur in all of the stores. For each transaction, a row in your table lists the store and the date of purchase, with separate columns for month and day, plus the cost of the transaction. The pivot table is a new table, created from your original table in the following way:
- Create a unique row for each store.
- Create a unique column for each month of the year.
- Assign every transaction to the appropriate row and column, by adding the cost of the transaction.
Although in this case the cells of the pivot table display a sum (total sales for that category), you can create other kinds of statistics. A count will tell you how many transactions occurred in each store during each month. An average will tell you the average cost of a transaction. See the parameter aggregation attributes for the complete list of statistics.
Typically, a pivot table is defined by three Attributes from your original ExampleSet:
- group by attributes - Usually categorical (nominal) values, they define the rows
- column grouping attribute - Usually categorical (nominal) values, they define the columns
- aggregation attributes - Usually numeric values, they are summed or averaged
However, the actual number of Attributes may be more or less than three, depending on your choices. You can, for example, choose more than one Attribute in the set of aggregation attributes.
Notice that in our original example, the group by attributes (store) and the column grouping attribute (month) have categorical values, while the aggregation attributes (cost) have numeric values. If an Attribute you have chosen for the aggregation attributes has categorical values, the range of statistics is more limited -- e.g., count is still available, but not average.
If an Attribute you have chosen for the group by attributes or the column grouping attribute has numeric values, you may find it convenient to first convert them to categorical values, by placing the numeric values into bins.
Differentiation
Aggregate
Just like the Pivot Operator, the Aggregate Operator has the parameters group by attributes and aggregation attributes, but it does not support the column grouping attribute. You can think of the Pivot Operator as a generalization of the Aggregate Operator, with a more advanced concept of groups.
Transpose
The Transpose operator exchanges the rows and columns of an ExampleSet, like the transpose of a matrix.
Input
- input (Data table)
This input port expects an ExampleSet.
Output
- output (Data table)
This output port delivers the pivot table.
- original (Data table)
The ExampleSet that was given as input is passed through without changes.
- preprocessing model (Preprocessing Model)
This port delivers the preprocessing model. It can be used by the Apply Model Operator to perform the specified pivot on another ExampleSet. The result of the Apply Model then has exactly the same attributes as at the output port of the Pivot. This is helpful for example if the Pivot Operator is used during training and the same transformation has to be applied on test or actual data. The preprocessing model can also be grouped together with other preprocessing models and prediction models by the Group Models Operator.
Parameters
- group_by_attributes
This parameter specifies the rows in the pivot table. Click on "Select Attributes" and choose one or more Attributes. If you choose a single Attribute, the rows will correspond to the unique values of that Attribute. If you choose two or more Attributes, the rows will correspond to unique combinations of the Attribute values. All data points with the same values of the Attributes belong in the same row.
Range: - column_grouping_attribute
This parameter specifies the columns in the pivot table. Select an Attribute from the drop-down. The columns will correspond to the unique values of this Attribute. All data points with the same value of the Attribute belong in the same column.
Range: - aggregation_attributes
Within each cell of the pivot table, defined by a row and a column, one or more statistics are calculated, using the aggregation functions. Click on "Edit List", and choose an Attribute and a function from the drop-downs. The complete list of aggregation functions is given below.
- average
- concatenation
- count
- count (fractional)
- count (including missings)
- count (percentage)
- first
- least
- log product
- maximum
- median
- minimum
- mode
- product
- standard deviation
- sum
- sum (fractional)
- variance
- use_default_aggregation
If checked, this parameter allows you to define a default aggregation function for the remaining Attributes -- all those that are not used by the first three parameters. Any remaining Attributes whose data type is not compatible with the default aggregation function will be ignored; e.g., notice that average is not available for categorical (nominal) Attributes.
Range: - default_aggregation_function
This parameter is available only when the parameter use default aggregation is set to true. It specifies the default aggregation function for the remaining Attributes -- all those that are not used by the first three parameters. The Attributes in question must be compatible with the default aggregation function. For example, if an ExampleSet has two nominal Attributes and all the rest are numeric, you can aggregate all of them by first including the two nominal Attributes in the aggregation attributes, with count as the aggregation function, then choosing average as the default aggregation function, to take care of the rest.
Range:
Tutorial Processes
Survival on the Titanic by Class and Sex
This Tutorial Process retrieves the Titanic data and uses the Pivot Operator to display the probability of survival for groups categorized according to "Passenger Class" and "Sex". Because "Survived" is a categorical variable ("Yes"/"No"), we first convert it to a numerical value via the Nominal to Numerical Operator; this Operator converts "Survived" into two new numerical Attributes: "Survived = No" and "Survived = Yes", with 0 and 1 as values.
For the parameters of the Pivot Operator, we choose: group by attributes - "Passenger Class" (3 values) column grouping attribute - "Sex" (2 values) aggregation attributes - "Survived = Yes", with average as the aggregation function.
The process includes a breakpoint, so you have to press the "Run" button twice:
1) Click "Run" to see the ExampleSet with the new Attributes.
2) Click "Run" again to see the probability of survival in each of six categories.
Family size on the Titanic
This Tutorial Process retrieves the Titanic data and uses the Pivot Operator to display the family size of groups categorized according to "Sex", "Passenger Class", and the value of "Survived". This pivot table uses more than three Attributes.
"Family Size" is not an Attribute of the Titanic data set, but you can create this Attribute with the help of the Generate Attributes Operator, by adding two Attributes that already exist: [Family Size] = [No of Parents or Children on Board] + [No of Siblings or Spouses on Board] + 1
For the parameters of the Pivot Operator, we choose: group by attributes - "Sex" and "Passenger Class" (2 x 3 = 6 values) column grouping attribute - "Survived" (2 values) aggregation attributes - "Family Size", with average as the aggregation function. "Family Size", with count as the aggregation function.
The process includes a breakpoint, so you have to press the "Run" button twice:
1) Click "Run" to see the ExampleSet with the new "Family Size" Attribute.
2) Click "Run" to see the pivot table, where the first two numerical columns show average family sizes for groups of individuals who did or did not survive, while the last two columns show the number of individuals (count) in each category.
The number of females in first and second class who did not survive is not significant (see the count values), but examining the row for third-class females, we see that the average family size for females who did not survive is significantly larger than for females who did survive (3 versus 2).