You are viewing the RapidMiner Radoop documentation for version 9.5 - Check here for latest version

What's new in RapidMiner Radoop 9.5

This page describes the new features of RapidMiner Radoop 9.5.

Radoop Proxy connection to Hadoop 3 based clusters

We have enhanced Radoop Proxy to work seamlessly with clusters based on Hadoop 3 (such as Cloudera CDH 6.x or HDP 3.x).

This means that if your organization runs a Hadoop cluster with a Hadoop 3 based distribution, network administrators will only need to open a few ports on the company firewall to enable data scientists to use RapidMiner Radoop with such a firewalled cluster.

Revamped general and connection-level settings

To make Radoop more user-friendly, we moved most of the settings from the RapidMiner Studio Preferences to Radoop connections. This allows you to conveniently set up your connections when connecting to multiple Hadoop clusters and use them without some settings interfering with each other.

Revamped Radoop general settings

As an example, on a production cluster with a lot more data, you might want to use a different timeout value for your Hive commands, than on the dev/test cluster. In Radoop 9.5, this is now quite easy as we moved the Hive command timeout setting from Studio Preferences to a connection level setting.

New location for Hive command timeout in the above example

Don't worry, all existing connections and settings will be preserved during an update to this version of Radoop.

Median and mode in Aggregate (Radoop) operator

To make it even easier to work with big data, we are continuously working on closing the gap between operators built into RapidMiner Studio and the ones optimized for Hadoop.

This time, we added median and mode as two new aggregation attributes. Behind the scenes these aggregations will leverage the power of optimized Hive queries to produce aggregates on large datasets quickly.

Median and mode

OpenJDK support

To support you and your company in adopting OpenJDK, RapidMiner Radoop now supports OpenJDK Java 8.

Overrides for advanced connection settings

Did it ever happen to you that you needed to tweak settings or advanced parameters for only a part of your process built up of Radoop operators? So far, the only way to do that was to duplicate the Radoop connection, adjust the required setting, and redesign your process with a separate Radoop Nest that used the duplicated connection.

With this new feature, you can define overrides to many of your connection settings and advanced parameters in the Radoop Nest, Subprocess (Radoop), Single Process Pushdown (Radoop) and SparkRM (Radoop) operators. The overrides will only have an effect inside these operators. Nesting them is also supported seamlessly.

For example, you have a process containing a Hive operator that runs long and would time out with your default settings.

Hive operator causing timeout

You can now incorporate that Hive operator into a Subprocess and override the timeout value used for that operation.

Hive operator with subprocess level override

Hive operator with subprocess level override

Hive operator with subprocess level override

You can also export the connection including the overrides as a new Radoop connection, e.g. for testing purposes and easier sharing.

We are hoping that with this new feature, the Radoop connection concept will be much clearer. It should reduce clutter in your connections list, and it moves the concept of tweaking job execution from the Radoop connection to the RapidMiner process itself.

Due to a bug that we fixed in version 9.5.2 of Radoop, overrides will need to be recreated after upgrading to this version. If you already have overrides defined, please make sure to recreate them.

Enhancements and bug fixes