You are viewing the RapidMiner Studio documentation for version 9.5 - Check here for latest version
Create ExampleSet (Utility)
Synopsis
This operator creates an ExampleSet with user-specified attributes and examples. Different data generator types are available.Description
This operator creates an ExampleSet with user specified-attributes and examples. Different data generator types are available. Currently supported types are:
- attribute functions: The user uses mathematical expressions to define attributes and specifies the number of examples to be created. This can be configured with the function description parameter and is similar to the function description parameter used in the Generate Attributes operator.
- numeric series: The user specifies the number of examples to create and configures the numeric series (e.g. linear, quadratic, exponential, ...) to create by the use of the numeric series configuration parameter.
- date series: The user specifies the number of examples to create and configures the date series to create by the use of the date series configuration parameter.
- comma separated text: The user specifies a text input by the input csv text parameter with comma separated values. The input text is converted to an ExampleSet.
Output
- output (IOObject)
The created ExampleSet.
Parameters
- generator_type
The type of generator to create the ExampleSet.
- attribute_functions: Attributes of the new ExampleSet can be created by the use of mathematical expressions via the function description parameter. This is similar to the function description parameter used in the Generate Attributes operator. The parameters number of examples, function descriptions and add id attribute are available to configure this generator type.
- numeric_series: Attributes of the new ExampleSet can be created as numerical series of different kinds (e.g. linear, quadratic, exponential, ...). Either a range ('startvalue' and 'stopvalue') or a 'startvalue' and a 'stepsize' can be used. Be aware that only the 'base' series is defined by this, the series type defines the function which is applied on the 'base' series to generate the resulting attributes. The parameters number of examples, use stepsize and numeric series configuration are available to configure this generator type.
- date_series: Attributes of the new ExampleSet can be created as date series. Either a date range ('start time' and 'end time') or a 'start time' and a 'stepsize' with different 'interval types' can be used. The parameters number of examples, use stepsize, date series configuration, date series configuration (interval) and date format are available to configure this generator type.
- comma_separated_text: The new ExampleSet is created by providing a text input with comma separated values. The first row is interpreted as the attribute names, the other rows contain the values. Attribute names can be trimmed. The types of the attributes are guessed, unless the 'parse all as nominal' parameter is set to true. The parameters input csv text, column separator, parse all as nominal, decimal point character and trim attributes are available to configure this generator type.
- number_of_examples The number of examples to generate. Available for generator types: attribute functions, numeric series, date series. Range:
- function_descriptions List of functions to generate. For more details about how to use this parameters, see help text of the Generate Attributes operator. Available for generator type: attribute functions. Range:
- add_id_attribute If this parameter is set to true an additional (numeric) id attribute is generated, which can be used in the function expressions. Be aware that this attribute is not listed in the expression. The name of the attribute is 'id' and has the 'id' role. Available for generator type: attribute functions. Range:
- use_stepsize If this parameter is set to true a 'start value' and a 'stepsize' is used in the series generation. If this parameter is set to false a 'start value' and 'stop value' is used. Available for generator types: numeric series, data series. Range:
- numeric_series_configuration
List of numeric series to generate. For each entry in the list an attribute will be created. The settings 'min' and 'max/stepsize' defines
the equidistant 'base' series x. The 'type' setting defines the function which is applied on the 'base' series to generate
the values for the new attribute. See the tutorial process 'Usage of the numeric_series generator' for example configurations.
Available for generator type: numeric_series.
- attribute_name: Name of the new attribute.
- type: Function applied on the 'base' series. linear: x, quadratic: x^2, square root: sqrt(x), power of 10: 10^x, power of 2: 2^x, power of E: e^x, ln: ln(x), log10: log10(x), log2: log2(x).
- min: Start value of the 'base' series.
- max/stepsize: If the parameter 'use stepsize' is true this parameter defines the stepsize between two entries of the 'base' series x. If it is false this parameter defines the stop value of the 'base' series. Than the series includes start value and stop value.
- date_series_configuration
List of date series to generate. For each entry in the list an attribute will be created. The settings 'start date' and 'end date' defines
the range of the date series. Both dates are included in the series. The date values in between are distributed equidistant.
Be aware that the values are equidistant on a millisecond level, thus depending on leap days and leap seconds the difference between
values may differ from known time units like years, days, .... See the tutorial process 'Usage of the date series generator' for
example configurations. This parameter is available if the parameter 'use stepsize' is set to true. If it is set to false,
the date series is configured by the similar parameter list date series configuration (interval), described below.
Available for generator type: date series.
- attribute_name: Name of the new attribute.
- start date: Start date of the date series. The input is interpreted by the format specified by the 'date format' parameter.
- end date: End date of the date series. The input is interpreted by the format specified by the 'date format' parameter.
- date_series_configuration (interval)
List of date series to generate. For each entry in the list an attribute will be created. The settings 'start date', 'stepsize' and
'interval type' defines the series. The values of the series starts with the 'start date', than 'stepsize' times the interval type
is added for each value. See the tutorial process 'Usage of the date series generator' for example configurations.
This parameter is available if the parameter 'use stepsize' is set to false. If it is set to true the date series is configured by
the similar parameter list 'date series configuration', described above. Available for generator type: date series.
- attribute name: Name of the new attribute.
- start date: Start date of the date series. The input is interpreted by the format specified by the 'date format' parameter.
- stepsize: For each value in the date series the time added to the previous value is stepsize times the date unit specified by the 'interval type'.
- interval type: Date unit to add for each value of the series. year, month, week, day, hour, minute, second, millisecond
- date_format Date format used in the 'start date' and 'end date' parameters. Available for generator type: date series. Range:
- input_csv_text Specify a text input with comma separated values. The first line is interpreted as the name of the attributes. The remaining lines are interpreted as the values of the attributes, separated by the 'column separator'. By default this is ',' but can be changed by the use of the 'column separator' parameter. If the parameter 'parse all as nominal' is set to false (default) the type of the attributes is guessed. Therefore the character used for the decimal point can be specified with the parameter 'decimal point character' (default: '.'). Available for generator type: comma separated text. Range:
- column_separator The character used by the operator to separate the columns in the input text. Available for generator type: comma separated text. Range:
- parse_all_as_nominal If this parameter is set to true no type guessing is performed for the attributes. All attributes are set to type NOMINAL. If it is set to false the type of the attributes is guessed after the input csv text was read. Depending on the number of rows this can increase the runtime. Available for generator type: comma separated text. Range:
- decimal_point_character If the parameter 'parse all as nominal' is set to false, the type guessing uses the character specified by this parameter as the decimal point. Available for generator type: comma separated text. Range:
- trim_attribute_names If this parameter is set to true leading and trailing whitespaces in the attribute names in the 'input csv text' are removed. Available for generator type: comma separated text. Range:
Tutorial Processes
Usage of the attribute_functions generator
This tutorial process uses the Create ExampleSet with the generator type 'attribute_functions' to generate a new ExampleSet. Different attributes are generated using the same expression editor as the Generate Attributes operator. Please have a look into the list of the 'function description' parameters of the operator.
Usage of the numeric_series generator
This tutorial process uses several Create ExampleSet operators with the generator type 'numeric_series' to generate different ExampleSets. Two operators demonstrate the two different types of configuring the 'base' series. Two other operators show some more advanced configuration.
Usage of the date_series generator
This tutorial process uses several Create ExampleSet operators with the generator type 'date_series' to generate different ExampleSets. The two different types of configuring the date series are demonstrated.
Usage of the comma_separated_text generator
This tutorial process uses several Create ExampleSet operators with the generator type 'comma_separated_text' to generate different ExampleSets.