Replace Missing Values (Series) (Time Series)
Synopsis
This operator replaces missing values in time series.Description
The parameter replace type numerical, replace type nominal and replace type date time defines the kind of replacement which is used, respectively for the type of the time series.. The parameters skip other missings, replace infinity, replace empty strings and ensure finite values handles how neighboring missing values, positive and negative infinity, empty strings and missing values at the start/end of the series are handled. Be aware that only when ensure finite values is set to true it can be ensured that no invalid values (missing, positive/negative infinity, emtpy strings) remain in the series after the replacement.
This operator works on all time series (numerical, nominal and time series with date time values).
Differentiation
Replace Missing Values
The standard Replace Missing Values operator from RapidMiner replaces every missing value with a constant value. This series based operator on the other hand, replaces missing values context based. This means, that a missing value is replaced based on a selected rule taking neighboring values into account.
Input
- example set (Data table)
The ExampleSet which contains the time series data as attributes.
Output
- example set (Data table)
The ExampleSet after applying the replacement. In case of overwrite attributes is true original time series attributes are overwritten, if not new attributes with the replaced values are added. For the name of the new attributes a postfix, specified by the new attributes postfix parameter, is added to the name of the original attributes. Other attributes are not changed.
Parameters
- attribute_filter_type
This parameter allows you to select the filter for the time series attributes selection filter; the method you want to select the attributes which holds the time series values. The different filter types are:
- all: This option selects all attributes of the ExampleSet to be time series attributes. This is the default option.
- single: This option allows the selection of a single time series attribute. The required attribute is selected by the attribute parameter.
- subset: This option allows the selection of multiple time series attributes through a list (see parameter attributes). If the meta data of the ExampleSet is known all attributes are present in the list and the required ones can easily be selected.
- regular_expression: This option allows you to specify a regular expression for the time series attribute selection. The regular expression filter is configured by the parameters regular expression, use except expression and except expression.
- value_type: This option allows selection of all the attributes of a particular type to be time series attributes. It should be noted that types are hierarchical. For example real and integer types both belong to the numeric type. The value type filter is configured by the parameters value type, use value type exception, except value type.
- block_type: This option allows the selection of all the attributes of a particular block type to be time series attributes. It should be noted that block types may be hierarchical. For example value_series_start and value_series_end block types both belong to the value_series block type. The block type filter is configured by the parameters block type, use block type exception, except block type.
- no_missing_values: This option selects all attributes of the ExampleSet as time series attributes which do not contain a missing value in any example. Attributes that have even a single missing value are not selected.
- numeric_value_filter: All numeric attributes whose examples all match a given numeric condition are selected as time series attributes. The condition is specified by the numeric condition parameter.
- attribute
The required attribute can be selected from this option. The attribute name can be selected from the drop down box of the parameter if the meta data is known.
Range: - attributes
The required attributes can be selected from this option. This opens a new window with two lists. All attributes are present in the left list. They can be shifted to the right list, which is the list of selected time series attributes.
Range: - regular_expression
Attributes whose names match this expression will be selected. The expression can be specified through the edit and preview regular expression menu. This menu gives a good idea of regular expressions and it also allows you to try different expressions and preview the results simultaneously.
Range: - use_except_expression
If enabled, an exception to the first regular expression can be specified. This exception is specified by the except regular expression parameter.
Range: - except_regular_expression
This option allows you to specify a regular expression. Attributes matching this expression will be filtered out even if they match the first expression (expression that was specified in regular expression parameter).
Range: - value_type
This option allows to select a type of attribute.
Range: - use_value_type_exception
If enabled, an exception to the selected type can be specified. This exception is specified by the except value type parameter.
Range: - except_value_type
The attributes matching this type will be removed from the final output even if they matched the before selected type, specified by the value type parameter.
Range: - block_type
This option allows to select a block type of attribute.
Range: - use_block_type_exception
If enabled, an exception to the selected block type can be specified. This exception is specified by the except block type parameter.
Range: - except_block_type
The attributes matching this block type will be removed from the final output even if they matched the before selected type by the block type parameter.
Range: - numeric_condition
The numeric condition used by the numeric condition filter type. A numeric attribute is selected if all examples match the specified condition for this attribute. For example the numeric condition '> 6' will keep all numeric attributes having a value of greater than 6 in every example. A combination of conditions is possible: '> 6 && < 11' or '<= 5 || < 0'. But && and || cannot be used together in one numeric condition. Conditions like '(> 0 && < 2) || (>10 && < 12)' are not allowed because they use both && and ||.
Range: - invert_selection
If this parameter is set to true the selection is reversed. In that case all attributes not matching the specified condition are selected as time series attributes. Special attributes are not selected independent of the invert selection parameter as along as the include special attributes parameter is not set to true. If so the condition is also applied to the special attributes and the selection is reversed if this parameter is checked.
Range: - include_special_attributes
Special attributes are attributes with special roles. These are: id, label, prediction, cluster, weight and batch. Also custom roles can be assigned to attributes. By default special attributes are not selected as time series attributes irrespective of the filter conditions. If this parameter is set to true, special attributes are also tested against conditions specified and those attributes are selected that match the conditions.
Range: - has_indices
This parameter indicates if there is an index attribute associated with the time series. If this parameter is set to true, the index attribute has to be selected.
Range: - indices_attribute
If the parameter has indices is set to true, this parameter defines the associated index attribute. It can be either a date, date_time or numeric value type attribute. The attribute name can be selected from the drop down box of the parameter if the meta data is known.
Range: - sort_time_series
If this parameter is selected, the input time series will be sorted, according to the selected indices attribute, before the time series operation is applied on. If it is not selected and the input time series is not sorted, a corresponding User Error is thrown.
Keep in mind that the indices values still needs to be unique. If the values are non-unique a corresponding User Error is thrown.
Range: - overwrite_attributes
This parameter indicates if the original time series attributes are overwritten by the resulting time series. If this parameter is set to false, the resulting new time series are added as new attributes to the ExampleSet. The name of these new attributes will be the name of the original time series with a postfix added. The postfix is specified by the parameter new attributes postfix.
Range: - new_attributes_postfix
If overwrite attributes is false, this parameter specifies the postfix which is added to the names of the original time series to create the new attribute names.
Range: - replace_type_numerical
The kind of replacement which is used to replace the missing values of numeric time series.
- previous value: The previous value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the first previous valid value. Missing values at the start of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next valid value is used as a replacement.
- next value: The next value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the next valid value. Missing values at the end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the first previous valid value is used as a replacement.
- average: The average of the neighboring values in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the average of the neighboring valid values. Missing values at the start and end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next, respectively previous valid value is used as a replacement.
- linear interpolation: A linear interpolation (using the index values from the index attribute) between the two neighboring values in the series is used to calculate the replacement value. If the parameter skip other missings is set to true, the next neighboring valid values are used to perform a linear interpolation and all missing values are replaced by the replacement values calculated by the linear interpolation (using the index values from the index attribute). Missing values at the start and end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next, respectively previous valid value is used as a replacement.
- value: All missing values are replaced by a constant value, specified by the replace value numerical parameter.
- replace_type_nominal
The kind of replacement which is used to replace the missing values of nominal time series.
- previous value: The previous value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the first previous valid value. Missing values at the start of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next valid value is used as a replacement.
- next value: The next value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the next valid value. Missing values at the end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the first previous valid value is used as a replacement.
- value: All missing values are replaced by a constant value, specified by the replace value nominal parameter.
- replace_type_date_time
The kind of replacement which is used to replace the missing values of time series with date time values (this is not used for the indices attribute)
- previous value: The previous value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the first previous valid value. Missing values at the start of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next valid value is used as a replacement.
- next value: The next value in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the next valid value. Missing values at the end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the first previous valid value is used as a replacement.
- average: The average of the neighboring values in the series is used as a replacement. If the parameter skip other missings is set to true, neighboring missing values are all replaced by the average of the neighboring valid values. Missing values at the start and end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next, respectively previous valid value is used as a replacement.
- linear interpolation: A linear interpolation (using the index values from the index attribute) between the two neighboring values in the series is used to calculate the replacement value. If the parameter skip other missings is set to true, the next neighboring valid values are used to perform a linear interpolation and all missing values are replaced by the replacement values calculated by the linear interpolation (using the index values from the index attribute). Missing values at the start and end of a series are not replaced, unless the parameter ensure finite values is set to true. Than the next, respectively previous valid value is used as a replacement.
- value: All missing values are replaced by a constant value, specified by the replace value date time parameter.
- replace_value_numerical
If replace type numerical is set to value this parameter specifies the replacement value for all missing values of numerical time series.
Range: - replace_value_nominal
If replace type nominal is set to value this parameter specifies the replacement value for all missing values of nominal time series.
Range: - replace_value_date_time
If replace type date time is set to value this parameter specifies the replacement value for all missing values of time series with date time values.
Range: - skip_other_missings
If this parameter is set to true, other neighboring values which are also missing are not considered for the determination of the replacement value. If this parameter is set to false and a replacement value would be also a missing value (e.g., replace type numerical is next value and the next value would be missing), the missing value is not replaced.
Range: - replace_infinity
If this parameter is set to true, also positive and negative infinity values are replaced in numerical time series. Otherwise they are handled as valid values and are not replaced and considered in the determination of the replacement value for a missing value (e.g. replace type numerical is average and one neighboring value is positive infinity, than the replacement value is also positive infinity).
Range: - replace_empty_strings
If this parameter is set to true, also empty strings are replaced in nominal time series. Otherwise they are handled as valid values and are not replaced and considered in the determination of the replacement value for a missing value (e.g. replace type nominal is next value and the next value is an empty string, than the replacement value is also an empty string).
Range: - ensure_finite_values
If this parameter is set to true, the operator ensures that no invalid values (missing, positive/negative infinity, empty strings) remain in the series after the replacement. The parameters skip other missings, replace infinity and replace emtpy strings are automatically set to true. It is also ensured that invalid values at the start/end of a series are replaced with valid ones. See the description of the different replace types for details.
Range:
Tutorial Processes
Replacement of missing values in Lake Huron data set
In this tutorial process we randomly set some values (10% of the values) of the Lake Huron data set to missing values. Than the Replace Missing Values (Series) operator is used to replace them again. One Replace operator is using previous value as the replace type, the other is using linear interpolation. Have a look at the result view to investigate the application of the replacement.