Categories

Versions

Generate Data (RapidMiner Studio Core)

Synopsis

This operator generates an ExampleSet based on numerical attributes. The number of attributes, number of examples, lower and upper bounds of attributes, and target function can be specified by the user.

Description

The Generate Data operator generates an ExampleSet with a specified number of numerical attributes which is controlled by the number of attributes parameter. Please note that in addition to the specified number of regular attributes, the label attribute is automatically generated by applying the function selected by the target function parameter. The selected target function is applied on the attributes to generate the label attribute. For example if the number of attributes parameter is set to 3 and the target function is set to 'sum'. Then three regular numerical attributes will be created. In addition to these regular attributes a label attribute will be generated automatically. As the target function is set to 'sum', the label attribute value will be the sum of all three regular attribute values.

The label target functions are calculated as follows (assuming n generated attributes):

  • random: The label is randomly generated.
  • sum (needs at least 3 attributes): The label is the sum of the arguments: att1 + att2 + ... + att[n]
  • polynomial (needs at least 3 attributes): att1^3 + att2^2 + att3
  • non linear (needs at least 3 attributes): att1 * att2 * att3 + att1 * att2 + att2 * att2
  • one variable non linear (needs 1 attribute): 3 * att1^3 - att1^2 + 1000 / |att1| + 2000 * |att1|
  • complicated function (needs at least 3 attributes): att1 * att1 * att2 + att2 * att3 + max(att1,att2) - e^att3
  • complicated function2 (needs at least 3 attributes): att1 * att1 * att1 + att2 * att2 + att1 * att2 + att1 / |att3| - 1 / (att3 * att3)
  • simple sinus (needs 1 attribute): sin(att1)
  • sinus (needs 2 attributes): sin(att1 * att2) + sin(att1 + att2)
  • simple superposition (needs 1 attribute): 5 * sin(att1) + sin(30 * att1)
  • sinus frequency (needs at least 2 attributes): 10 * sin(3 * att1) + 12 * sin(7 * att1) + 11 * sin( 5 * att2) + 9 * sin(10 * att2) + 10 * sin(8 * (att1 + att2))
  • sinus with trend (needs 1 attribute): sin(att1) + 0.1 * att1
  • sinc: sin(x) / ||x||, if ||x|| is not 0, else 0.
  • triangular function (needs 1 attribute): The label is the fractional part of the argument.
  • square pulse function (needs 1 attribute): The label is a square pulse in the attribute.
  • random classification: The label is randomly "negative" or "positive".
  • one third classification: The label is "positive" if att1 < 0.3333 and "negative" else.
  • sum classification: The label is "positive" if the sum of all arguments is positive, else "negative".
  • quadratic classification (needs at least 2 attributes): The label is "positive" if att2 > att1^2, else "negative".
  • simple non linear classification (needs at least 2 attributes): The label is "positive" if 50 < att1*att2 < 80, else "negative".
  • interaction classification (needs at least 3 attributes): The label is "positive" if att1 < 0 or att2 > 0 and att3 < 0, else "negative".
  • simple polynomial classification (needs at least 1 attribute): The label is "positive" if att1^4 > 100, else "negative".
  • polynomial classification (needs at least 4 attributes): The label is "positive" if att1^3 + att2^2 - att3^2 + att4 > 0, else "negative".
  • checkerboard classification (needs 2 attributes): The label is "positive" or "negative", according to a checkerboard pattern, where the size of each tile is 5.
  • random dots classification (needs 2 attributes): Some randomly sized and placed positive and negative dots are generated on the 2D field. The label is "positive" if the example is only contained by positive dots, else "negative".
  • global and local models classification (needs 2 attributes): The label is "positive" if the sum of both arguments is positive, else "negative". In addition, several local patterns in different sizes are placed in the data space.
  • sinus classification (needs at least 2 attributes): The label is "positive" if sin(att1*att2) + sin(att1+att2) > 0, else "negative".
  • multi classification: The label is "one" if the sum of all arguments modulo 2 is 0, "two" if the sum modulo 3 is 0, "three" if the sum modulo 5 is 0, else "four".
  • two gaussians classification: Generates two Gaussian clusters. The label is either "cluster0" or "cluster1".
  • transactions dataset (needs at least 5 attributes): Generates an association function transaction dataset, all attribute values are 0 or 1. The first four attributes are correlated. No label is generated.
  • grid function: Generates a uniformly distributed grid in the given dimensions. A label with zero value is generated.
  • three ring clusters (needs 2 attributes): Generates three concentric ring clusters. The label values are "core", "first_ring" and "second_ring", accordingly.
  • spiral cluster (needs 2 attributes): Generates two interlocking spiral clusters. The label values are "spiral1" and "spiral2", accordingly.
  • single gaussian cluster: Generates a Gaussian cluster. A label with zero value is generated.
  • gaussian mixture clusters: Generates a mixture of Gaussian clusters. Each attribute doubles the cluster amount, so 2^n clusters are generated. A label with the cluster id is generated.
  • driller oscillation timeseries (needs at least 2 attributes): Generates an artificial audio data set (based on real-world data from drilling processes). No label is generated.

Output

  • output (Data Table)

    The Generate Data operator generates an ExampleSet based on numerical attributes which is delivered through this port. The meta data is also delivered along with the data.This output is same as the output of the Retrieve operator.

Parameters

  • target_functionThis parameter specifies the target function for generating the label attribute. There are different options; users can choose any of them. Range: selection
  • number_examplesThis parameter specifies the number of examples to be generated. Range: integer
  • number_of_attributesThis parameter specifies the number of regular attributes to be generated. Please note that the label attribute is generated automatically besides these regular attributes. Range: integer
  • attributes_lower_boundThis parameter specifies the minimum possible value for the attributes to be generated. In other words this parameter specifies the lower bound of the range of possible values of regular attributes. In case of target functions using Gaussian distribution, the attribute values may exceed this value. Range: real
  • attributes_upper_boundThis parameter specifies the maximum possible value for the attributes to be generated. In other words this parameter specifies the upper bound of the range of possible values of regular attributes. In case of target functions using Gaussian distribution, the attribute values may exceed this value. Range: real
  • gaussian_standard_deviationThis parameter specifies the standard deviation of the Gaussian distribution used for generating attributes. Range: real
  • largest_radiusThis parameter specifies the radius of the outermost ring cluster. Range: real
  • use_local_random_seedThis parameter indicates if a local random seed should be used for randomization. Using the same value of local random seed will produce the same ExampleSet. Changing the value of this parameter changes the way examples are randomized, thus the ExampleSet will have a different set of values. Range: boolean
  • local_random_seedThis parameter specifies the local random seed. This parameter is only available if the use local random seed parameter is set to true. Range: integer
  • data_managementThis is an expert parameter. A long list is provided; users can select any option from this list. Range: selection

Tutorial Processes

Introduction to the Generate Data operator

The Generate Data operator is applied for generating an ExampleSet. The target function parameter is set to 'sum', thus the label attribute will be the sum of all attributes' values. The number examples parameter is set to 100, thus the ExampleSet will have 100 examples. The number of attributes parameter is set to 3, thus three numerical attributes will be generated beside the label attribute. The attributes lower bound and attributes upper bound parameters are set to -10 and 10 respectively, thus values of the regular attributes will be within this range. You can verify this by viewing the results in the Results Workspace. The use local random seed parameter is set to false in this Example process. Set the use local random seed parameter to true and run the process with different values of local random seed. You will see that changing the values of local random seed changes the randomization.