Categories

Versions

You are viewing the RapidMiner Developers documentation for version 9.8 - Check here for latest version

API changes in RapidMiner 9.8

From ExampleSet to Belt Table

Forget about the ExampleSet class and start using com.rapidminer.belt.table.Table, RapidMiner's new representation of example sets. The corresponding framework is called Belt. It comes with several advantages compared to ExampleSet:

  • Column-oriented design: a column-oriented data layout allows for using compact representations for the different column types.
  • Immutability: all columns and tables are immutable. This not only guarantees data integrity but also allows for safely reusing components, e.g., multiple tables can safely reference the same column.
  • Thread-safety: all public data structures are thread-safe and designed to perform well when used concurrently.
  • Implicit parallelism: Many of Belt's built-in functionality, such as the transformations shown in the examples below, automatically scale out to multiple cores.

To learn everything about the Belt framework please refer to the official documentation of the Belt project.

This page will focus on the differences between the old example set and the new Belt framework and present some examples on how to implement operators using the Belt framework and the Table class.

If you are new to extension development for RapidMiner Studio, then Create your own extension is a great starting point for you.

Sum operator example

Let's start with an example. We will create an operator that takes a table with only numeric columns, calculates the sum for each row and adds these row sums as a new column to the resulting table.

First of all the doWork() method. You receive the input table by calling:

IOTable ioTable = tableInput.getData(IOTable.class);
Table table = ioTable.getTable();

You need not worry if the actual data at the port is an IOTable or an ExampleSet since RapidMiner will automatically convert it to the requested format. This makes the collaboration between new operators working on Tables and old operators working on ExampleSets easy.

Then to make the code a little bit cleaner we will outsource the actual work to the calculateSum method.

// read table, calculate sum and return new table
Table result = calculateSum(table);

Now deliver the resulting table to the output port.

IOTable newIOTable = new IOTable(result);
newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
tableOutput.deliver(newIOTable);

Since the Table class itself is not an IOObject we need to wrap it with the IOTable class. Also it is important to copy the annotations of the input IOTable to the new IOTable because otherwise they will be lost.

Finally, it is good practice to also deliver the input table to an output port:

originalOutput.deliver(ioTable);

That's the doWork() method. Let's move on to implement the calculateSum(Table table) method. First of all check that the given Table contains only numeric columns. The BeltErrorTools class holds some convenience methods for this kind of checks.

BeltErrorTools.onlyNumeric(table, getName(), this);

Next, we will determine whether the result will be of type real or integer. If any column is of type real, the result will also be of type real. The table provides a ColumnSelector that can be accessed via the select() method. A column selector can be used to filter the columns of a table via predicates. The default predicates filter regarding type, category, capability and meta data (e.g. roles). You can even define your own predicates for custom filter operations. The ofTypeId method does the trick:

boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

Since the Column class is immutable, we need a column buffer to fill and instantiate a new column:

NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
                : Buffers.integer53BitBuffer(table.height());

Tables can be read column-wise or row-wise. In this case we want to read it row-wise so that we can calculate the sum for each row:

NumericRowReader reader = Readers.numericRowReader(table);
for (int i = 0; i < buffer.size(); i++) {
    // move must be called to advance the reader to the next row
    reader.move();
    double sum = 0;
    for (int j = 0; j < reader.width(); j++) {
        // reader.get(j) returns the value of the j-th column of the row
        sum += reader.get(j);
    }
    buffer.set(i, sum);
}

The move method advances the reader to the next row. Please note that it must be called before the first row is read.

We have calculated the row sums and filled them into the buffer. Next, copy the original table and add a new column to it. Since the Table class is immutable we will use a table builder:

TableBuilder builder = Builders.newTableBuilder(table);
builder.add("Sum", buffer.toColumn());

Please note that the data stored in the buffer cannot be modified anymore after calling the toColumn method. Attempting to do so will lead to an Exception.

Nearly done! All that's left to do is to build and to return the table. And this is where Belt's implicit parallelism comes into play. The build method takes the operator's context that can be accessed via the BeltTools class and runs the build process in parallel.

Table result = builder.build(BeltTools.getContext(this));
return result;

This concludes the example. Since for now ExampleSetMetaData will be used as meta data class for Belt tables we will not go through the meta data transformation in detail.

import com.rapidminer.adaption.belt.IOTable;
import com.rapidminer.belt.buffer.Buffers;
import com.rapidminer.belt.buffer.NumericBuffer;
import com.rapidminer.belt.column.Column;
import com.rapidminer.belt.reader.NumericRowReader;
import com.rapidminer.belt.reader.Readers;
import com.rapidminer.belt.table.Builders;
import com.rapidminer.belt.table.Table;
import com.rapidminer.belt.table.TableBuilder;
import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.UserError;
import com.rapidminer.operator.ports.InputPort;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.ports.metadata.AttributeMetaData;
import com.rapidminer.operator.ports.metadata.ExampleSetMetaData;
import com.rapidminer.operator.ports.metadata.MetaData;
import com.rapidminer.operator.ports.metadata.MetaDataInfo;
import com.rapidminer.operator.ports.metadata.PassThroughRule;
import com.rapidminer.operator.ports.metadata.SimplePrecondition;
import com.rapidminer.tools.Ontology;
import com.rapidminer.tools.belt.BeltErrorTools;
import com.rapidminer.tools.belt.BeltTools;


/**
 * This operator takes a {@link Table} with only numeric columns, calculates the sum for each row
 *  and adds it as a new column.
 *
 * @author Kevin Majchrzak
 * @since 9.8
 */
public class SumOperator extends Operator {

    private final InputPort tableInput = getInputPorts().createPort("example set input");
    private final OutputPort tableOutput = getOutputPorts().createPort("example set output");
    private final OutputPort originalOutput = getOutputPorts().createPort("original");

    public SumOperator(OperatorDescription description) {
        super(description);
        // we want example set meta data as input
        tableInput.addPrecondition(new SimplePrecondition(tableInput, new ExampleSetMetaData()));
        // pass through the original data
        getTransformer().addPassThroughRule(tableInput, originalOutput);
        // generate meta data for new table
        getTransformer().addRule(new PassThroughRule(tableInput, tableOutput, true) {
            @Override
            public MetaData modifyMetaData(MetaData metaData) {
                if (metaData instanceof ExampleSetMetaData) {
                    ExampleSetMetaData emd = (ExampleSetMetaData) metaData;
                    boolean resultIsReal = emd.containsAttributesWithValueType(Ontology.REAL, true)
                            != MetaDataInfo.NO;
                    AttributeMetaData sumAttribute = resultIsReal ? new AttributeMetaData("Sum", Ontology.REAL)
                            : new AttributeMetaData("Sum", Ontology.INTEGER);
                    emd.addAttribute(sumAttribute);
                }
                return metaData;
            }
        });
    }

    @Override
    public void doWork() throws OperatorException {
        // fetch table from input port
        IOTable ioTable = tableInput.getData(IOTable.class);
        Table table = ioTable.getTable();

        // read table, calculate sum and return new table
        Table result = calculateSum(table);

        // wrap the result into an IOTable
        IOTable newIOTable = new IOTable(result);

        // copy the annotations from the original IOTable
        newIOTable.getAnnotations().addAll(ioTable.getAnnotations());

        // deliver the new IOTable to the port
        tableOutput.deliver(newIOTable);

        // deliver original table to corresponding port
        originalOutput.deliver(ioTable);
    }

    /**
     * Takes a {@link Table} with only numeric columns, calculates the sum for each row and adds it as a new column.
     *
     * @param table
     *      the original table
     * @return a new table with the original columns and a sum column
     * @throws UserError
     *      if the table contains non-numeric columns
     */
    private Table calculateSum(Table table) throws UserError {
        // check that all columns are numeric
        BeltErrorTools.onlyNumeric(table, getName(), this);

        // If any column is of type real the result will be real. Otherwise, it will be integer.
        boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

        // initialize numeric buffer needed to create sum column
        NumericBuffer buffer = resultIsReal ? Buffers.realBuffer(table.height())
                : Buffers.integer53BitBuffer(table.height());

        // read the table row-wise and store the sum of each row in the buffer
        NumericRowReader reader = Readers.numericRowReader(table);
        for (int i = 0; i < buffer.size(); i++) {
            // move must be called to advance the reader to the next row
            reader.move();
            double sum = 0;
            for (int j = 0; j < reader.width(); j++) {
                // reader.get(j) returns the value of the j-th column of the row
                sum += reader.get(j);
            }
            buffer.set(i, sum);
        }

        // copy original table using table builder
        TableBuilder builder = Builders.newTableBuilder(table);
        // add the new column to the builder
        builder.add("Sum", buffer.toColumn());

        // build the new table in parallel using the operator's context
        Table result = builder.build(BeltTools.getContext(this));
        return result;
    }
}

In this example you have seen how to fetch and deliver a table from and to ports. How to read a table and processed its data, create a new column using a buffer and return a modified table using the TableBuilder class.

There are alternative ways to implement the operator, of course. Look, for example, at the following code:

private Table calculateSum(Table table) throws UserError {
    // check that all columns are numeric
    BeltErrorTools.onlyNumeric(table, getName(), this);

    // If any column is of type real the result will be real. Otherwise, it will be integer.
    boolean resultIsReal = !table.select().ofTypeId(Column.TypeId.REAL).labels().isEmpty();

    // this function will be applied in parallel to the table rows
    ToDoubleFunction<NumericRow> sumUpRow = row -> {
        double sum = 0;
        for (int j = 0; j < row.width(); j++) {
            sum += row.get(j);
        }
        return sum;
    };
    // the results will be collected in a numeric buffer
    NumericBuffer buffer;
    if(resultIsReal){
        buffer = table.transform().applyNumericToReal(sumUpRow, BeltTools.getContext(this));
    } else {
        buffer = table.transform().applyNumericToInteger53Bit(sumUpRow, BeltTools.getContext(this));
    }

    // copy original table using table builder
    TableBuilder builder = Builders.newTableBuilder(table);
    // add the new column to the builder
    builder.add("Sum", buffer.toColumn());

    // build the new table in parallel using the operator's context
    Table result = builder.build(BeltTools.getContext(this));
    return result;
}

This code uses the Table's transform method and a row transformer to achieve the same results as the calculateSum method presented earlier. Details on the transform method can be found here. Using the transform method comes with the additional advantage that the summations potentially takes place in parallel. Belt once again makes use of the operator's context to automatically decide if and how to parallelize the computation.

The next example shows how to use generators to fill columns and how to add meta data like, for example, roles to a table.

ID generator example

Next, let's implement an operator that takes a table and adds an ID column to it. Here is the code of its doWork() method:

@Override
public void doWork() throws OperatorException {
    // fetch table from input port and initialize builder
    IOTable ioTable = tableInput.getData(IOTable.class);
    Table table = ioTable.getTable();
    TableBuilder builder = Builders.newTableBuilder(table);

    // add id column via generator
    builder.addInt53Bit("ID", i -> i);

    // set column role
    builder.addMetaData("ID", ColumnRole.ID);

    // add annotations and deliver results
    Table result = builder.build(BeltTools.getContext(this));
    IOTable newIOTable = new IOTable(result);
    newIOTable.getAnnotations().addAll(ioTable.getAnnotations());
    tableOutput.deliver(newIOTable);

    // deliver original table to corresponding port
    originalOutput.deliver(ioTable);
}

We fetch the input table and initlialize the builder with it just as we did before. Then add the id column via:

builder.addInt53Bit("ID", i -> i);

This line of code makes use of one of the table builder's convenience methods that takes a label and a generator and automatically fills the column. Furthermore, it does not fill the column straight away but does so later when the build method is called. Thereby, the builder can fill all columns in parallel.

Let's take a closer look at the generator. For numeric column types it is represented via an IntToDoubleFunction. The generator consumes a row index and returns the value for that row. Our implementation returns the row index itself as the result and, thereby, generates ids from 0 to the number of rows - 1. Similar generator methods for other column types are also available in the table builder.

The next step is to set the column's role to ColumnRole.ID. The builder's addMetaData method takes a column label and meta data to attach to the corresponding column. Since ColumnRole implements ColumnMetaData it can be attached via this method.

Finally, the resulting table is wrapped into an IOTable, the annotations are copied and the table is delivered to the output port.

ColumnMetaData

ColumnMetaData represents additional information that can be attached to columns. Classes implementing the ColumnMetaData by default are:

  • ColumnRole: Representing the roles used in Studio to mark special columns like, for example, labels.
  • ColumnAnnotation: A textual description of the column.
  • ColumnReference: A reference to another column that is somehow related to the column. An example would be a prediction column referencing the label column that it refers to.

Custom meta data can be added to the columns by implementing the ColumnMetaData interface.

Please note that column annotations and references are not visualized in RapidMiner Studio yet, but we plan on doing so in the near future.

Two important changes have been made to column roles. Firstly, roles need not be unique anymore. A table can have multiple label, prediction and even id columns. This comes in handy, e.g., when working with learners that expect multiple labels. Secondly, in Belt the set of column roles is fixed to BATCH, CLUSTER, ID, LABEL, OUTLIER, PREDICTION, SCORE, WEIGHT, INTERPRETATION, ENCODING, SOURCE and METADATA. While the first eleven of them are the default roles, METADATA stands for anything other than the known roles. Columns marked as METADATA will usually be ignored by operators (e.g. when creating models). Legacy roles that do not exist in Belt will be mapped to METADATA.

Automatic conversion between Table and ExampleSet

Table will be converted to ExampleSet and vice versa depending on the format the operator requests a port to deliver it in. This conversion is done very efficient so that in most cases this will not impact the overall performance of a process.

Please note:

  • Since ExampleSet expects roles to be unique, non unique roles will have an index appended to their name when converting from Table to ExampleSet. When such a role is converted back at a later point in the process, the unnecessary index will automatically be removed.
  • Attribute / column types will be mapped to the next best representation in the converted format. Some of the Belt column types do not have a representation in the old API. Therefore, attempting to deliver an IOTable holding column types not included in BeltConverter.STANDARD_TYPES will lead to an exception. This restriction may be removed in one of the future releases.

MetaData class for IOTables

To this point ExampleSetMetaData is the MetaData class used to describe IOTables at the operator ports. This works to a certain degree well because ExampleSet and Table both represent data tables and they are conceptually similar. Nevertheless, in the near future we will release an IOTable specific meta data class that can better represent the new Belt tables.