Join (RapidMiner Studio Core)

Synopsis

This operator joins two ExampleSets using specified key attribute(s) of the two ExampleSets.

Description

The Join operator joins two ExampleSets using one or more attributes of the input ExampleSets as key attributes. Identical values of the key attributes indicate matching examples. An attribute with id role is selected as key by default but an arbitrary set of one or more attributes can be chosen as key. Four types of joins are possible: inner, left, right and outer join. All these types of joins are explained in the parameters section.

Input

  • left (Data Table)

    The left input port expects an ExampleSet. This ExampleSet will be used as the left ExampleSet for the join.

  • right (Data Table)

    The right input port expects an ExampleSet. This ExampleSet will be used as the right ExampleSet for the join.

Output

  • join (Data Table)

    The join of the left and right ExampleSets is delivered through this port.

Parameters

  • remove_double_attributesThis parameter indicates if double attributes should be removed or renamed. Double attributes are those attributes that are present in both ExampleSets. If this parameter is checked, from attributes which are present in both ExampleSets only the one from the left ExampleSet will be taken and the one from the right ExampleSet would be discarded. The key attributes will always be taken from the left ExampleSet. Please note that this check for double attributes will only be applied for regular attributes. Special attributes of the right ExampleSet which do not exist in the left ExampleSet will simply be added. If they already exist they are simply skipped. Range: boolean
  • join_typeThis parameter specifies which Join should be performed. You can easily understand these joins by studying the Example Process. Four types of joins are supported:
    • inner: The resultant ExampleSet will contain only those examples where the key attributes of both input ExampleSets match i.e. have the same value.
    • left: This is also called left outer join. The resultant ExampleSet will contain all records from the left ExampleSet. If no matching records were found in the right ExampleSet, then its fields will contain the null value i.e. '?' will be stored there. The left join will always contain the results of the inner join; however it can contain some examples that have no matching examples in the right ExampleSet.
    • right: This is also called right outer join. The resultant ExampleSet will contain all records from the right ExampleSet. If no matching records were found in the left ExampleSet, then its fields will contain the null values i.e. '?' will be stored there. The right join will always contain the results of the inner join; however it can contain some examples that have no matching examples in the left ExampleSet.
    • outer: This is also called full outer join. This type of join combines the results of the left and the right join. All examples from both ExampleSets will be part of the resultant ExampleSet, whether the matching key attribute value exists in the other ExampleSet or not. If no matching key attribute value was found then the corresponding resultant fields will have a null value. The outer join will always contain the results of the inner join; however it can contain some examples that have no matching examples in the other ExampleSet.
    Range: selection
  • use_id_attribute_as_keyThis parameter indicates if the attribute with the id role should be used as the key attribute. This option is checked by default. If unchecked, then you have to specify the key attributes for both left and right ExampleSets. Identical values of the key attributes indicate matching examples Range: boolean
  • key_attributesThis parameter specifies attribute(s) which are used as the key attributes. Identical values of the key attributes indicate matching examples. For each key attribute from the left ExampleSet a corresponding one from the right ExampleSet has to be chosen. Choosing appropriate key attributes is critical for obtaining the desired results. This parameter is available only when the use id attribute as key parameter is unchecked. Range:
  • keep_both_join_attributesIf checked, both columns of a join pair will be kept. Usually this is unneccessary since both attributes are identical. It may be useful to keep such a column if there are missing values on one side. Range:

Tutorial Processes

Different types of join

The last operator of this process is the Join operator. The sequence of operators leading to the left input port of the Join operator is used to generate the left ExampleSet. Similarly, the sequence of operators leading to the right input port of the Join operator is used to generate the right ExampleSet. The sequence of operators leading to the left and right input ports of the Join operator are pretty similar.

In both cases the Retrieve operator is used to load the 'Golf' data set. Then the Generate Attribute operator is applied on it to generate a dummy attribute. All attributes of the 'Golf' data set other than the 'Play' attribute and the newly generated attribute are discarded because the keep all parameter is unchecked. Then the Generate ID operator is applied to generate an attribute with the id role. This attribute will later be used as the key attribute for joining.

The only difference is that for the left ExampleSet, the name of the attribute generated by the Generate Attribute operator is 'Golf 1 attribute' and for the right ExampleSet the name of this attribute is 'Golf 2 attribute'. The other major difference is in the value of the offset parameter of the Generate ID operator. For the left ExampleSet the offset parameter of the Generate ID operator is set to 0 and for the right ExampleSet it is set to 7. Thus the left ExampleSet has id from 1 to 14 and the right ExampleSet has id from 8 to 21. The breakpoints are inserted after the Generate ID operator so that you can have a look at the left and right ExampleSets before application of the Join operator.

The use id attribute as key parameter of the Join operator is set to true. Thus attributes with id role will be used to join the left and right ExampleSets. The remove double attributes parameter is also checked. Thus regular attributes common in both input ExampleSets would appear just once in the resultant ExampleSet. Only the 'Play' and 'id' attributes are common in both the ExampleSets, but as they are not regular attributes so the remove double attributes parameter has no effect on them. As mentioned earlier the key attributes will always be taken from the left ExampleSet. Pease note that this check for double attributes will only be applied for regular attributes. Special attributes of the right ExampleSet which do not exist in the left ExampleSet will simply be added. If they already exist they are simply skipped.

In this example process the join type is set as inner join. You can change it to other values and run the process again. Here is an explanation of results that are obtained on each value of the join type parameter.

If inner join is selected as join type the resultant ExampleSet has examples with ids from 8 to 14. This is because the inner join delivers only those examples where the key attribute of both input ExampleSets have the same values. In this example process, the left ExampleSet has ids from 1 to 14 and the right ExampleSet has ids from 8 to 21. Thus examples with ids from 8 to 14 have equal value of the key attribute (i.e. the id attribute).

If left join is selected as join type the resultant ExampleSet has examples with ids from 1 to 14. This is because the left join delivers all examples of the left ExampleSet with corresponding values of the right ExampleSet. If there is no match present in the right ExampleSet, the null value is placed at its place. This is why you can see that the 'Golf 2 attribute' has null values for ids 1 to 7.

If right join is selected as join type the resultant ExampleSet has examples with ids from 8 to 21. This is because the right join delivers all examples of the right ExampleSet with corresponding values of the left ExampleSet. If there is no match present in the left ExampleSet, a null value is placed at its place. This is why you can see that the 'Golf 1 attribute' has null values for ids 15 to 21.

If outer join is selected as join type the resultant ExampleSet has examples with ids from 1 to 21. This is because the outer join combines the results of the left and right join. All examples from both ExampleSets will be part of the resultant ExampleSet, whether the matching key attribute value exists in the other ExampleSet or not. If no matching key attribute value was found then the corresponding resultant fields will have a null value. In this example process the left ExampleSet has ids from 1 to 14 and the right ExampleSet has ids from 8 to 21. Thus examples with ids from 1 to 21 are part of the resultant ExampleSet. The 'Golf 2 attribute' has null values in examples with ids from 1 to 7. Similarly, the 'Golf 1 attribute' has null values in examples with ids from 15 to 21. There are no null values in examples with ids 8 to 14. The 'Play' attribute has null values in examples with id from 15 to 21. This is because special attributes are taken from the left ExampleSet which in this example process has no values of the 'Play' attribute corresponding to ids 15 to 21.