Categories

Versions

API changes in RapidMiner 9.8 and 9.9

From ExampleSet to Belt Table

Forget about the ExampleSet class and start using com.rapidminer.belt.table.Table, RapidMiner's new representation of example sets. The corresponding framework is called Belt. It comes with several advantages compared to ExampleSet:

  • Column-oriented design: a column-oriented data layout allows for using compact representations for the different column types.
  • Immutability: all columns and tables are immutable. This not only guarantees data integrity but also allows for safely reusing components, e.g., multiple tables can safely reference the same column.
  • Thread-safety: all public data structures are thread-safe and designed to perform well when used concurrently.
  • Implicit parallelism: Many of Belt's built-in functionality, such as the transformations shown in the examples below, automatically scale out to multiple cores.

To learn everything about the Belt framework please refer to the official documentation of the Belt project.

This page will focus on the differences between the old example set and the new Belt framework and present some examples on how to implement operators using the Belt framework and the Table class.

If you are new to extension development for RapidMiner Studio, then Create your own extension is a great starting point for you.

Sum operator example

Let's start with an example. We will create an operator that takes a table with only numeric columns, calculates the sum for each row and adds these row sums as a new column to the resulting table.

Data transformation

First of all the doWork() method. You receive the input table by calling:

IOTable ioTable = tableInput.getData(IOTable.class);
Table table = ioTable.getTable();

You need not worry if the actual data at the port is an IOTable or an ExampleSet since RapidMiner will automatically convert it to the requested format. This makes the collaboration between new operators working on Tables and old operators working on ExampleSets easy.

Then to make the code a little bit cleaner we will outsource the actual work to the calculateSum method.

// read table, calculate sum and return new table
Table result = calculateSum(table);

Now deliver the resulting table to the output port.

IOTable newIOTable = new IOTable(result);
newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
tableOutput.deliver(newIOTable);

Since the Table class itself is not an IOObject we need to wrap it with the IOTable class. Also it is important to copy the annotations of the input IOTable to the new IOTable because otherwise they will be lost.

Finally, it is good practice to also deliver the input table to an output port:

originalOutput.deliver(ioTable);

That's the doWork() method. Let's move on to implement the calculateSum(Table table) method. First of all check that the given Table contains only numeric columns. The BeltErrorTools class holds some convenience methods for this kind of checks.

BeltErrorTools.onlyNumeric(table, getName(), this);

Next, we will determine whether the result will be of type real or integer. If any column is of type real, the result will also be of type real. The table provides a ColumnSelector that can be accessed via the select() method. A column selector can be used to filter the columns of a table via predicates. The default predicates filter regarding type, category, capability and meta data (e.g. roles). You can even define your own predicates for custom filter operations. The ofTypeId method does the trick:

boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

Since the Column class is immutable, we need a column buffer to fill and instantiate a new column:

NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
                : Buffers.integer53BitBuffer(table.height());

Tables can be read column-wise or row-wise. In this case we want to read it row-wise so that we can calculate the sum for each row:

NumericRowReader reader = Readers.numericRowReader(table);
for (int i = 0; i < buffer.size(); i++) {
    // move must be called to advance the reader to the next row
    reader.move();
    double sum = 0;
    for (int j = 0; j < reader.width(); j++) {
        // reader.get(j) returns the value of the j-th column of the row
        sum += reader.get(j);
    }
    buffer.set(i, sum);
}

The move method advances the reader to the next row. Please note that it must be called before the first row is read.

We have calculated the row sums and filled them into the buffer. Next, copy the original table and add a new column to it. Since the Table class is immutable we will use a table builder:

TableBuilder builder = Builders.newTableBuilder(table);
builder.add("Sum", buffer.toColumn());

Please note that the data stored in the buffer cannot be modified anymore after calling the toColumn method. Attempting to do so will lead to an Exception.

Nearly done! All that's left to do is to build and to return the table. And this is where Belt's implicit parallelism comes into play. The build method takes the operator's context that can be accessed via the BeltTools class and runs the build process in parallel.

Table result = builder.build(BeltTools.getContext(this));
return result;

This concludes the data transformation for the operator.

Meta data transformation

Next, let's implement the meta data transformation. The meta data class for IOTables (called TableMetaData) comes with methods and functionality similar to what the Table class offers:

public SumOperator(OperatorDescription description) {
    super(description);
    // we want TableMetaData with only numeric columns as input
    tableInput.addPrecondition(new TablePrecondition(tableInput, Column.Category.NUMERIC));
    // pass through the original data
    getTransformer().addRule(new TablePassThroughRule(tableInput, originalOutput, SetRelation.EQUAL));
    // generate meta data for new table
    getTransformer().addRule(new TablePassThroughRule(tableInput, tableOutput, SetRelation.EQUAL) {
        @Override
        public TableMetaData modifyTableMetaData(TableMetaData metaData) {
            return SumOperator.this.calculateSumMD(metaData);
        }
    });
}

The first few lines should be familiar to you if you have implemented meta data transformation before. Use a precondition to show warnings to the user if the provided meta data is not TableMetaData or if it holds any non-numeric columns. Columns of category numeric are either integer or real columns. The first TablePassThroughRule passes through the table meta data to the original output without any modifications. Add a second rule and override the modifyTableMetaData method. The meta data transformation can be done similar to the data transformation:

/**
 * Analogue to {@link #calculateSum(Table)} but for {@link TableMetaData}.
 *
 * @param metaData
 *      the original TableMetaData
 * @return new TableMetaData with the original columns and a sum column
 */
private TableMetaData calculateSumMD(TableMetaData metaData) {
    // If any column is of type real the result will be real. Otherwise, it will be integer.
    boolean resultIsReal = metaData.containsType(ColumnType.REAL, true) != MetaDataInfo.NO;
    // copy original TableMetaData using TableMetaDataBuilder
    TableMetaDataBuilder builder = new TableMetaDataBuilder(metaData);
    // add the new column to the builder
    if (resultIsReal) {
        builder.addReal("Sum", null, SetRelation.UNKNOWN, MDInteger.newPossible());
    } else {
        builder.addInteger("Sum", null, SetRelation.UNKNOWN, MDInteger.newPossible());
    }
    // build the new TableMetaData
    return builder.build();
}

Firstly, check if the table meta data contains any columns of type real. If any of the original columns is of type real, the resulting new column will also be of type real. Otherwise it will be of type integer. Since the table meta data is immutable, use a TableMetaDataBuilder to copy and to modify it.

You can use one of the convenience methods addReal or addInteger to add a new column to the meta data. The first argument of these methods takes the new column's name. Secondly, it expects the numeric range and a set relation describing uncertainty regarding the given range. Since we do not know a lot about the actual data it is hard to predict the range of the resulting column. The null and SetRelation.UNKNOWN arguments inform the builder that we do not know the resulting range. Lastly, the MDInteger.newPossible() argument sets the number of missing values to >= 0.

Build the new table meta data via the builder's build method. This concludes the example.

import com.rapidminer.adaption.belt.IOTable;
import com.rapidminer.belt.buffer.Buffers;
import com.rapidminer.belt.buffer.NumericBuffer;
import com.rapidminer.belt.column.Column;
import com.rapidminer.belt.column.ColumnType;
import com.rapidminer.belt.reader.NumericRowReader;
import com.rapidminer.belt.reader.Readers;
import com.rapidminer.belt.table.Builders;
import com.rapidminer.belt.table.Table;
import com.rapidminer.belt.table.TableBuilder;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.UserError;
import com.rapidminer.operator.ports.InputPort;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.ports.metadata.MDInteger;
import com.rapidminer.operator.ports.metadata.MetaDataInfo;
import com.rapidminer.operator.ports.metadata.SetRelation;
import com.rapidminer.operator.ports.metadata.table.TableMetaData;
import com.rapidminer.operator.ports.metadata.table.TableMetaDataBuilder;
import com.rapidminer.operator.ports.metadata.table.TablePassThroughRule;
import com.rapidminer.operator.ports.metadata.table.TablePrecondition;
import com.rapidminer.tools.belt.BeltErrorTools;
import com.rapidminer.tools.belt.BeltTools;


/**
 * This operator takes a {@link Table} with only numeric columns, calculates the sum for each row
 * and adds it as a new column.
 */
public class SumOperator extends Operator {

    private final InputPort tableInput = getInputPorts().createPort("example set input");
    private final OutputPort tableOutput = getOutputPorts().createPort("example set output");
    private final OutputPort originalOutput = getOutputPorts().createPort("original");

    public SumOperator(OperatorDescription description) {
        super(description);
        // we want TableMetaData with only numeric columns as input
        tableInput.addPrecondition(new TablePrecondition(tableInput, Column.Category.NUMERIC));
        // pass through the original data
        getTransformer().addRule(new TablePassThroughRule(tableInput, originalOutput, 
            SetRelation.EQUAL));
        // generate meta data for new table
        getTransformer().addRule(new TablePassThroughRule(tableInput, tableOutput, 
            SetRelation.EQUAL) {
            @Override
            public TableMetaData modifyTableMetaData(TableMetaData metaData) {
                return SumOperator.this.calculateSumMD(metaData);
            }
        });
    }

    @Override
    public void doWork() throws OperatorException {
        // fetch table from input port
        IOTable ioTable = tableInput.getData(IOTable.class);
        Table table = ioTable.getTable();

        // read table, calculate sum and return new table
        Table result = calculateSum(table);

        // wrap the result into an IOTable
        IOTable newIOTable = new IOTable(result);

        // copy the annotations from the original IOTable
        newIOTable.getAnnotations().addAll(ioTable.getAnnotations());

        // deliver the new IOTable to the port
        tableOutput.deliver(newIOTable);

        // deliver original table to corresponding port
        originalOutput.deliver(ioTable);
    }

    /**
     * Takes a {@link Table} with only numeric columns, calculates the sum for each row and adds it
     * as a new column.
     *
     * @param table
     *      the original table
     * @return a new table with the original columns and a sum column
     * @throws UserError
     *      if the table contains non-numeric columns
     */
    private Table calculateSum(Table table) throws UserError {
        // check that all columns are numeric
        BeltErrorTools.onlyNumeric(table, getName(), this);

        // If any column is of type real the result will be real. Otherwise, it will be integer.
        boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

        // initialize numeric buffer needed to create sum column
        NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
                : Buffers.integer53BitBuffer(table.height());

        // read the table row-wise and store the sum of each row in the buffer
        NumericRowReader reader = Readers.numericRowReader(table);
        for (int i = 0; i < buffer.size(); i++) {
            // move must be called to advance the reader to the next row
            reader.move();
            double sum = 0;
            for (int j = 0; j < reader.width(); j++) {
                // reader.get(j) returns the value of the j-th column of the row
                sum += reader.get(j);
            }
            buffer.set(i, sum);
        }

        // copy original table using table builder
        TableBuilder builder = Builders.newTableBuilder(table);
        // add the new column to the builder
        builder.add("Sum", buffer.toColumn());

        // build the new table in parallel using the operator's context
        Table result = builder.build(BeltTools.getContext(this));
        return result;
    }

    /**
     * Analogue to {@link #calculateSum(Table)} but for {@link TableMetaData}.
     *
     * @param metaData
     *      the original TableMetaData
     * @return new TableMetaData with the original columns and a sum column
     */
    private TableMetaData calculateSumMD(TableMetaData metaData) {
        // If any column is of type real the result will be real. Otherwise, it will be integer.
        boolean resultIsReal = metaData.containsType(ColumnType.REAL, true) != MetaDataInfo.NO;
        // copy original TableMetaData using TableMetaDataBuilder
        TableMetaDataBuilder builder = new TableMetaDataBuilder(metaData);
        // add the new column to the builder
        if (resultIsReal) {
            builder.addReal("Sum", null, SetRelation.UNKNOWN, MDInteger.newPossible());
        } else {
            builder.addInteger("Sum", null, SetRelation.UNKNOWN, MDInteger.newPossible());
        }
        // build the new TableMetaData
        return builder.build();
    }
}

In this example you have seen how to fetch and deliver a table from and to ports. How to read a table and process its data, create a new column using a buffer and return a modified table using the TableBuilder class.

There are alternative ways to implement the operator, of course. Look, for example, at the following code:

private Table calculateSum(Table table) throws UserError {
    // check that all columns are numeric
    BeltErrorTools.onlyNumeric(table, getName(), this);

    // If any column is of type real the result will be real. Otherwise, it will be integer.
    boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

    // this function will be applied in parallel to the table rows
    ToDoubleFunction<NumericRow> sumUpRow = row -> {
        double sum = 0;
        for (int j = 0; j < row.width(); j++) {
            sum += row.get(j);
        }
        return sum;
    };
    // the results will be collected in a numeric buffer
    NumericBuffer buffer;
    if(resultIsReal){
        buffer = table.transform().applyNumericToReal(sumUpRow, BeltTools.getContext(this));
    } else {
        buffer = table.transform().applyNumericToInteger53Bit(sumUpRow, BeltTools.getContext(this));
    }

    // copy original table using table builder
    TableBuilder builder = Builders.newTableBuilder(table);
    // add the new column to the builder
    builder.add("Sum", buffer.toColumn());

    // build the new table in parallel using the operator's context
    Table result = builder.build(BeltTools.getContext(this));
    return result;
}

This code uses the Table's transform method and a row transformer to achieve the same results as the calculateSum method presented earlier. Details on the transform method can be found here. Using the transform method comes with the additional advantage that the summations potentially take place in parallel. Belt once again makes use of the operator's context to automatically decide if and how to parallelize the computation.

The next example shows how to use generators to fill columns and how to add column meta data like, for example, roles to a table.

ID generator example

Next, let's implement an operator that takes a table and adds an ID column to it. Here is the code of its doWork() method:

@Override
public void doWork() throws OperatorException {
    // fetch table from input port and initialize builder
    IOTable ioTable = tableInput.getData(IOTable.class);
    Table table = ioTable.getTable();
    TableBuilder builder = Builders.newTableBuilder(table);

    // add id column via generator
    builder.addInt53Bit("ID", i -> i);

    // set column role
    builder.addMetaData("ID", ColumnRole.ID);

    // add annotations and deliver results
    Table result = builder.build(BeltTools.getContext(this));
    IOTable newIOTable = new IOTable(result);
    newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
    tableOutput.deliver(newIOTable);

    // deliver original table to corresponding port
    originalOutput.deliver(ioTable);
}

We fetch the input table and initialize the builder with it just as we did before. Then add the id column via:

builder.addInt53Bit("ID", i -> i);

This line of code makes use of one of the table builder's convenience methods that takes a label and a generator and automatically fills the column. Furthermore, it does not fill the column straight away but does so later when the build method is called. Thereby, the builder can fill all columns in parallel.

Let's take a closer look at the generator. For numeric column types it is represented via an IntToDoubleFunction. The generator consumes a row index and returns the value for that row. Our implementation returns the row index itself as the result and, thereby, generates ids from 0 to the number of rows - 1. Similar generator methods for other column types are also available in the table builder.

The next step is to set the column's role to ColumnRole.ID. The builder's addMetaData method takes a column label and meta data to attach to the corresponding column. Since ColumnRole implements ColumnMetaData it can be attached via this method.

Finally, the resulting table is wrapped into an IOTable, the annotations are copied, and the table is delivered to the output port.

Meta data transformation

Start by adding a constructor similar to what we have done in the last example:

public IDOperator(OperatorDescription description) {
    super(description);
    // we want TableMetaData as input
    tableInput.addPrecondition(new TablePrecondition(tableInput));
    // pass through the original data
    getTransformer().addRule(new TablePassThroughRule(tableInput, originalOutput, SetRelation.EQUAL));
    // generate meta data for new table
    getTransformer().addRule(new TablePassThroughRule(tableInput, tableOutput, SetRelation.EQUAL) {
        @Override
        public TableMetaData modifyTableMetaData(TableMetaData metaData) {
            return IDOperator.this.transformMetaData(metaData);
        }
    });
}

Once again, the actual meta data transformation is outsourced to a private method for better readability:

private TableMetaData transformMetaData(TableMetaData metaData) {
    // determine range
    MDInteger numRows = metaData.height();
    Range range = null;
    if (numRows.getNumber() > 0) {
        range = new Range(0, numRows.getNumber() - 1);
    }

    // determine set relation for the range
    SetRelation relation;
    switch (numRows.getRelation()) {
        case AT_LEAST:
            relation = SetRelation.SUPERSET;
            break;
        case EQUAL:
            relation = SetRelation.EQUAL;
            break;
        case AT_MOST:
            relation = SetRelation.SUBSET;
            break;
        case UNKNOWN:
        default:
            relation = SetRelation.UNKNOWN;
    }

    // build id column
    ColumnInfoBuilder columnBuilder = new ColumnInfoBuilder(ColumnType.INTEGER_53_BIT);
    columnBuilder.setNumericRange(range, relation);
    columnBuilder.setMissings(0);
    ColumnInfo idColumn = columnBuilder.build();

    // add id column to table
    TableMetaDataBuilder builder = new TableMetaDataBuilder(metaData);
    builder.add("ID", idColumn);

    // set column role id
    builder.addColumnMetaData("ID", ColumnRole.ID);

    return builder.build();
}

Since the operator generates id values between 0 and table height - 1 we can infer the range of the resulting id column. If we are uncertain about the table height, this translates into uncertainty about the range. Therefore, the appropriate set relation for the range is determined with the switch statement.

The TableMetaData's columns are represented via the immutable ColumnInfo class. Build a new column info of type integer with the calculated range and relation using a ColumnInfoBuilder. Also set the number of missing values to exactly 0 since the operator will never generate missing values. Finally, add the new column to the table meta data and set its role to ColumnRole.ID using the table meta data builder's addColumnMetaData method.

import com.rapidminer.adaption.belt.IOTable;
import com.rapidminer.belt.column.ColumnType;
import com.rapidminer.belt.table.Builders;
import com.rapidminer.belt.table.Table;
import com.rapidminer.belt.table.TableBuilder;
import com.rapidminer.belt.util.ColumnRole;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.ports.InputPort;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.ports.metadata.MDInteger;
import com.rapidminer.operator.ports.metadata.SetRelation;
import com.rapidminer.operator.ports.metadata.table.ColumnInfo;
import com.rapidminer.operator.ports.metadata.table.ColumnInfoBuilder;
import com.rapidminer.operator.ports.metadata.table.TableMetaData;
import com.rapidminer.operator.ports.metadata.table.TableMetaDataBuilder;
import com.rapidminer.operator.ports.metadata.table.TablePassThroughRule;
import com.rapidminer.operator.ports.metadata.table.TablePrecondition;
import com.rapidminer.tools.belt.BeltTools;
import com.rapidminer.tools.math.container.Range;


/**
 * This operator takes a {@link Table} and adds an ID column to it.
 */
public class IDOperator extends Operator {

    private final InputPort tableInput = getInputPorts().createPort("example set input");
    private final OutputPort tableOutput = getOutputPorts().createPort("example set output");
    private final OutputPort originalOutput = getOutputPorts().createPort("original");

    public IDOperator(OperatorDescription description) {
        super(description);
        // we want TableMetaData as input
        tableInput.addPrecondition(new TablePrecondition(tableInput));
        // pass through the original data
        getTransformer().addRule(new TablePassThroughRule(tableInput, originalOutput, 
            SetRelation.EQUAL));
        // generate meta data for new table
        getTransformer().addRule(new TablePassThroughRule(tableInput, tableOutput, 
            SetRelation.EQUAL) {
            @Override
            public TableMetaData modifyTableMetaData(TableMetaData metaData) {
                return IDOperator.this.transformMetaData(metaData);
            }
        });
    }

    @Override
    public void doWork() throws OperatorException {
        // fetch table from input port and initialize builder
        IOTable ioTable = tableInput.getData(IOTable.class);
        Table table = ioTable.getTable();
        TableBuilder builder = Builders.newTableBuilder(table);

        // add id column via generator
        builder.addInt53Bit("ID", i -> i);

        // set column role
        builder.addMetaData("ID", ColumnRole.ID);

        // add annotations and deliver results
        Table result = builder.build(BeltTools.getContext(this));
        IOTable newIOTable = new IOTable(result);
        newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
        tableOutput.deliver(newIOTable);

        // deliver original table to corresponding port
        originalOutput.deliver(ioTable);
    }

    private TableMetaData transformMetaData(TableMetaData metaData) {
        // determine range
        MDInteger numRows = metaData.height();
        Range range = null;
        if (numRows.getNumber() > 0) {
            range = new Range(0, numRows.getNumber() - 1);
        }

        // determine set relation for the range
        SetRelation relation;
        switch (numRows.getRelation()) {
            case AT_LEAST:
                relation = SetRelation.SUPERSET;
                break;
            case EQUAL:
                relation = SetRelation.EQUAL;
                break;
            case AT_MOST:
                relation = SetRelation.SUBSET;
                break;
            case UNKNOWN:
            default:
                relation = SetRelation.UNKNOWN;
        }

        // build id column
        ColumnInfoBuilder columnBuilder = new ColumnInfoBuilder(ColumnType.INTEGER_53_BIT);
        columnBuilder.setNumericRange(range, relation);
        columnBuilder.setMissings(0);
        ColumnInfo idColumn = columnBuilder.build();

        // add id column to table
        TableMetaDataBuilder builder = new TableMetaDataBuilder(metaData);
        builder.add("ID", idColumn);

        // set column role id
        builder.addColumnMetaData("ID", ColumnRole.ID);

        return builder.build();
    }
}

ColumnMetaData

ColumnMetaData represents additional information that can be attached to columns. Classes implementing ColumnMetaData by default are:

  • ColumnRole: Representing the roles used in Studio to mark special columns like, for example, labels.
  • ColumnAnnotation: A textual description of the column.
  • ColumnReference: A reference to another column that is somehow related to the column. An example would be a prediction column referencing the label column that it refers to.

Custom meta data can be added to the columns by implementing the ColumnMetaData interface.

Please note that column annotations and references are not visualized in RapidMiner Studio yet, but we plan on doing so in the near future.

Two important changes have been made to column roles. Firstly, roles need not be unique anymore. A table can have multiple label, prediction and even id columns. This comes in handy, e.g., when working with learners that expect multiple labels. Secondly, in Belt the set of column roles is fixed to BATCH, CLUSTER, ID, LABEL, OUTLIER, PREDICTION, SCORE, WEIGHT, INTERPRETATION, ENCODING, SOURCE and METADATA. While the first eleven of them are the default roles, METADATA stands for anything other than the known roles. Columns marked as METADATA will usually be ignored by operators (e.g. when creating models). Legacy roles that do not exist in Belt will be mapped to METADATA.

Automatic conversion between Table and ExampleSet / TableMetaData and ExampleSetMetaData

Table will be converted to ExampleSet and vice versa depending on the format the operator requests a port to deliver it in. (The same holds true for TableMetaData and ExampleSetMetaData.) This conversion is done very efficient so that in most cases this will not impact the overall performance of a process.

Please note:

  • Since ExampleSet expects roles to be unique, non-unique roles will have an index appended to their name when converting from Table to ExampleSet. When such a role is converted back at a later point in the process, the unnecessary index will automatically be removed.
  • Attribute / column types will be mapped to the next best representation in the converted format. Some of the Belt column types do not have a representation in the old API. Therefore, attempting to deliver an IOTable holding column types not included in BeltConverter.STANDARD_TYPES will lead to an exception. This restriction may be removed in one of the future releases.

MetaData class for IOTables

Since RapidMiner version 9.9 there is an IOTable specific meta data class called TableMetaData that should be used for the meta data transformation. The TableMetaData class is conceptually very similar to the Table class and, therefore, easy to use once you have understood the Table class.

For RapidMiner version 9.8 ExampleSetMetaData is the legacy MetaData class used to describe IOTables at the operator ports.