Cross Distances (RapidMiner Studio Core)
Synopsis
This operator calculates the distance between each example of a 'request set' ExampleSet to each example of a 'reference set' ExampleSet. This operator is also capable of calculating similarity instead of distance.Description
The Cross Distances operator takes two ExampleSets as input i.e. the 'reference set' and 'request set' ExampleSets. It creates an ExampleSet that contains the distance between each example of the 'request set' ExampleSet to each example of the 'reference set' ExampleSet. Please note that both input ExampleSets should have the same attributes and in the same order. This operator will not work properly if the order of the attributes is different. This operator is also capable of calculating similarity instead of distance. If the compute similarities parameter is set to true, similarities are calculated instead of distances. Please note that both input ExampleSets should have id attributes. If id attributes are not present, this operator automatically creates id attributes for such ExampleSets. The measure to use for calculating the distances can be specified through the parameters. Four type of measures are provided: mixed measures, nominal measures, numerical measures and Bregman divergences.
If data is imported from two different sources that are supposed to represent the same data but which have columns in different orders, the Cross Distances operator will not behave as expected. It is possible to work round this by using the Generate Attributes operator to recreate attributes in both ExampleSets in the same order.
Input
- request set (Data Table)
This input port expects an ExampleSet. This ExampleSet will be used as the 'request set'. Please note that both input ExampleSets ( 'request set' and 'reference set') should have the same attributes and in the same order. This operator will not work properly if the order of the attributes is different. Also note that both input ExampleSets should have id attributes. If id attributes are not present, this operator automatically creates id attributes for such ExampleSets.
- reference set (Data Table)
This input port expects an ExampleSet. This ExampleSet will be used as the 'reference set'. Please note that both input ExampleSets ( 'request set' and 'reference set') should have same attributes and in the same order. This operator will not work properly if the order of the attributes is different. Also note that both input ExampleSets should have id attributes. If id attributes are not present, this operator automatically creates id attributes for such ExampleSets.
Output
- result set (Data Table)
An ExampleSet that contains the distance (or similarity, if the compute similarities parameter is set to true) between each example of the 'request set' ExampleSet to each example of the 'reference set' ExampleSet is delivered through this port.
- request set (Data Table)
The 'request set' ExampleSet that was provided at the request set input port is delivered through this port. If the input ExampleSet had an id attribute then the ExampleSet is delivered without any modification. Otherwise an id attribute is automatically added to the input ExampleSet.
- reference set (Data Table)
The 'reference set' ExampleSet that was provided at the reference set input port is delivered through this port. If the input ExampleSet had an id attribute then the ExampleSet is delivered without any modification. Otherwise an id attribute is automatically added to the input ExampleSet.
Parameters
- measure_typesThis parameter is used for selecting the type of measure to be used for calculating distances (or similarity).The following options are available: mixed measures, nominal measures, numerical measures and Bregman divergences. Range: selection
- mixed_measureThis parameter is available when the measure type parameter is set to 'mixed measures'. The only available option is the 'Mixed Euclidean Distance' Range: selection
- nominal_measureThis parameter is available when the measure type parameter is set to 'nominal measures'. This option cannot be applied if the input ExampleSet has numerical attributes. If the input ExampleSet has numerical attributes the 'numerical measure' option should be selected. Range: selection
- numerical_measureThis parameter is available when the measure type parameter is set to 'numerical measures'. This option cannot be applied if the input ExampleSet has nominal attributes. If the input ExampleSet has nominal attributes the 'nominal measure' option should be selected. Range: selection
- divergenceThis parameter is available when the measure type parameter is set to 'bregman divergences'. Range: selection
- kernel_typeThis parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance'. The type of the kernel function is selected through this parameter. Following kernel types are supported:
- dot: The dot kernel is defined by k(x,y)=x*y i.e.it is inner product of x and y.
- radial: The radial kernel is defined by exp(-g ||x-y||^2) where g is the gamma that is specified by the kernel gamma parameter. The adjustable parameter gamma plays a major role in the performance of the kernel, and should be carefully tuned to the problem at hand.
- polynomial: The polynomial kernel is defined by k(x,y)=(x*y+1)^d where d is the degree of the polynomial and it is specified by the kernel degree parameter. The Polynomial kernels are well suited for problems where all the training data is normalized.
- neural: The neural kernel is defined by a two layered neural net tanh(a x*y+b) where a is alpha and b is the intercept constant. These parameters can be adjusted using the kernel a and kernel b parameters. A common value for alpha is 1/N, where N is the data dimension. Note that not all choices of a and b lead to a valid kernel function.
- sigmoid: This is the sigmoid kernel. Please note that the sigmoid kernel is not valid under some parameters.
- anova: This is the anova kernel. It has adjustable parameters gamma and degree.
- epachnenikov: The Epanechnikov kernel is this function (3/4)(1-u2) for u between -1 and 1 and zero for u outside that range. It has two adjustable parameters kernel sigma1 and kernel degree.
- gaussian_combination: This is the gaussian combination kernel. It has adjustable parameters kernel sigma1, kernel sigma2 and kernel sigma3.
- multiquadric: The multiquadric kernel is defined by the square root of ||x-y||^2 + c^2. It has adjustable parameters kernel sigma1 and kernel sigma shift.
- kernel_gammaThis is the SVM kernel parameter gamma. This parameter is available when only the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to radial or anova. Range: real
- kernel_sigma1This is the SVM kernel parameter sigma1. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to epachnenikov, gaussian combination or multiquadric. Range: real
- kernel_sigma2This is the SVM kernel parameter sigma2. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
- kernel_sigma3This is the SVM kernel parameter sigma3. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to gaussian combination. Range: real
- kernel_shiftThis is the SVM kernel parameter shift. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to multiquadric. Range: real
- kernel_degreeThis is the SVM kernel parameter degree. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to polynomial, anova or epachnenikov. Range: real
- kernel_aThis is the SVM kernel parameter a. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real
- kernel_bThis is the SVM kernel parameter b. This parameter is available only when the numerical measure parameter is set to 'Kernel Euclidean Distance' and the kernel type parameter is set to neural. Range: real
- only_top_kThis parameter indicates if only the k nearest to each request example should be calculated. Range: boolean
- kThis parameter is only available when the only top k parameter is set to true. It determines how many of the nearest examples should be shown in the result. Range: integer
- search_forThis parameter is only available when the only top k parameter is set to true. It determines if the nearest or the farthest distances should be selected. Range: selection
- compute_similarities If this parameter is set true, similarities are computed instead of distances. All measures will still be usable, but measures that are not originally distance or respective similarity measure are transformed to match optimization direction. Range: boolean
Tutorial Processes
Introduction to the Cross Distances operator
This Example Process starts with a Subprocess operator. This subprocess generates the 'request set' ExampleSet and the 'reference set' ExampleSet. A breakpoint is inserted here so that you can have a look at the ExampleSets before application of the Cross Distances operator. You can see that the 'request set' has only 1 example with id 'id_1'. The 'reference set' has just two examples with ids 'id_1' and 'id_2'. Both ExampleSets have three attributes in the same order. It is very important that both ExampleSets should have the same attributes and in the same order otherwise the Cross Distances operator will not behave as expected. The Cross Distances operator is applied on these ExampleSets. The resultant ExampleSet that contains the distance between each example of the 'request set' ExampleSet to each example of the 'reference set' ExampleSet is calculated by the Cross Distance operator. The resultant ExampleSet can be viewed in the Results Workspace.