Categories

Versions

Function Fitting (RapidMiner Studio Core)

Synopsis

Fits a parametrized numeric function to a set of data points.

Description

This operator takes a parametrized numeric function and a set of data points and fits the function to the data points. It does so by minimizing the objective function

obj(a) = ?(f_a(xk) - yk)²

where (xk, yk), k ∈ {1, ... , N} are the N given data points, f_a is the parametrized function and a the set of function parameters.

Use the expression parameter to specify a parametrized function. Variables that are not attributes in the input example set are automatically recognized and optimized as function parameters.

(Please note: This operator is in a beta state and its behavior may change in future releases.)

Input

  • training set (Data Table)

    This input port expects an ExampleSet. It should hold the label and the variables used in the parametrized function.

Output

  • prediction (Data Table)

    The training set with an additional prediction column. The prediction column is the result of applying the generated model to the data points.

  • parameters (Data Table)

    Example set holding the optimized parameter values and the corresponding error.

  • model (Model)

    Model holding the fitting information.

  • original (Data Table)

    The training set is passed without any modifications through this port.

Parameters

  • expression The parametrized numeric function can be specified here. Use the calculator button to the right to open the 'Edit Expression' window. Range: string
  • optimization_algorithm The optimization algorithm used to minimize the objective function.
    • Michael J. D. Powell's BOBYQA (Bound Optimization BY Quadratic Approximation) algorithm. This algorithm can be applied to problem dimensions >= 2.
    • CMA-ES (Covariance Matrix Adaptation Evolution Strategy) algorithm. This algorithm can be applied to problem dimensions >= 1.
    Range: selection
  • initial_parameter_values The initial parameter values. If you specify bounds for the parameters, then the initial parameter values must lie within these bounds. Range: list
  • parameter_bounds Bounds for the parameter values. Please assure that the initial values lie within these bounds. Range: list
  • max_iterations The maximum number of iterations to be used for the model fitting. Range: integer
  • max_evaluations The maximum number of function evaluations to be used for the model fitting. Range: integer
  • set_interpolation_points

    BOBYQA optimization parameter:

    Check this parameter to manually set the number of interpolation points.

    Range: boolean
  • interpolation_points

    BOBYQA optimization parameter:

    The number of interpolation points used to locally approximate the objective function.

    (This parameter is only available if the set interpolation points parameter is set to true.)

    Range: integer
  • initial_trust

    BOBYQA optimization parameter:

    The initial trust region radius.

    Range: real
  • stop_trust

    BOBYQA optimization parameter:

    Stopping criterion. The algorithm stops if the trust region radius drops below this threshold.

    Range: real
  • sigma

    CMA-ES optimization parameter:

    The initial standard deviation for sampling new search points. Large values lead to a broader, small values to a more local search.

    Range: real
  • set_population_size

    CMA-ES optimization parameter:

    Check this parameter to manually set the population size. By default the algorithm uses a population size of 4 + 3 * ln(n), where n is the number of optimized function parameters.

    Range: boolean
  • population_size

    CMA-ES optimization parameter:

    The number offspring used to explore the search space.

    (This parameter is only available if the set population size parameter is set to true.)

    Range: integer
  • use_local_random_seed

    CMA-ES optimization parameter:

    This parameter indicates if a local random seed should be used for randomization. Using the same value for local random seed will produce the same randomization.

    Range: boolean
  • local_random_seed

    CMA-ES optimization parameter:

    This parameter specifies the local random seed.

    (This parameter is only available if the use local random seed parameter is set to true.)

    Range: integer
  • active_cma

    CMA-ES optimization parameter:

    If set to true, the algorithm will use active covariance matrix adaption.

    Range: boolean
  • diagonal_only

    CMA-ES optimization parameter:

    Number of initial iterations with diagonal covariance matrix. Special case: Setting this parameter to 1 means keeping the covariance matrix always diagonal.

    Range: integer
  • feasible_count

    CMA-ES optimization parameter:

    Number of times new random offspring is generated in case it is outside of the defined bounds.

    Range: integer
  • stop_improvement

    CMA-ES optimization parameter:

    Stopping criterion. Algorithm stops if the error improvement is below the given threshold.

    Range: real
  • stop_error

    CMA-ES optimization parameter:

    Stopping criterion. Algorithm stops if the error is below the given threshold.

    Range: real

Tutorial Processes

Applying the Function Fitting operator to a data set

A set of data points ((x,y) pairs) is read and the y attribute is declared to be the label. We use the Function Fitting operator to find a function and the corresponding function parameters that fit the data points well. First, try out a simple linear function. Click on the uppermost operator to see the chosen function:

a * x + b

Since x is the only variable being part of the training set, a and b are automatically recognized and optimized as function parameters. Run the process to see the results. You should see the optimized values for the parameters a and b as well as the corresponding error. The error is the sum of the squared errors between the label and the prediction of the label. The second example set holds the predicted values. You can plot them against x to get a feeling for the generated model.

The linear model seems to be too simple to capture the curved trend of the data points. The second and third operators use parametrized polynomial and sin functions instead. Click on these operators to view the expressions. Then connect their output ports, run the process and view the results. The error for these models turns out to be much lower. If you plot the predictions you will see that they match the data points nicely.

The optimized c parameter for the sinus model in the previous example turned out to be negative. Let's assume that we want this parameter to be positive instead. There can be many practical reasons for constraints like this. For example, c could represent the amount of a certain ingredient used in our companies product. If we cannot add a negative amount to a product, it makes sense to rule out these solutions.

Click on the last operator that uses the same sin function as before. Then click on show advanced parameters so that you can access the parameter bounds. If you inspect the bounds you can see that a lower bound of 0 has been added to the c parameter here. Furthermore, an initial value of 0.5 has been set for c. It is important to assure that the initial parameter values lie within the specified bounds.

Connect the operator's output ports and run the process. You will see that the results for this bounded problem turn out to be a lot worse. The error is high and the model does not fit the data points well. Maybe the optimizer got stuck in a local optimum?

Let's choose a more robust optimizer that is less likely to fall for local optima. Click on the operator again and choose CMA-ES as optimization algorithm instead. Run the process to get a low error as well as a positive c parameter. Just what we were looking for!

This ends the guided tutorial. Try out your own expressions, next. You can use any expression that can be build using our expression parser. For help with that, press the calculator button next to the expression parameter.

Also, try out your own data points. Interested in high dimensional data? The Function Fitting operator can handle data points of higher dimensions in just the same way.