What’s New in RapidMiner Radoop 7.4?

This page describes the new features of RapidMiner Radoop 7.4 as well as its enhancements and bug fixes.

Update / migration

Update is available through the RapidMiner Marketplace.

Introducing SparkRM

RapidMiner Radoop 7.4 introduces SparkRM (available with the “Enterprise” license). With SparkRM any operator or process existing in RapidMiner Studio can be run in parallel in a Hadoop environment, leveraging Spark as the execution framework.

The user-defined Subprocess (i.e. visually defined code) in the new SparkRM meta-operator can contain any in-memory RapidMiner operator, including those from extensions. The operator encapsulates that subprocess and pushes it to Hadoop, where it is automatically executed inside of Spark on potentially multiple Hadoop nodes. The input data provided to the SparkRM operator is partitioned (according to the values of an attribute, linearly or just randomly) and distributed to the Hadoop nodes beforehand. The RapidMiner subprocess is then run on all those partitions, potentially in many Hadoop nodes. After execution, the result is merged if it’s a coherent dataset, or returned as a collection otherwise.

SparkRM opens up a variety of new use cases that can now be solved by Radoop natively on Hadoop, especially those that need an extension, like text analytics, process mining, time series analytics or forecasting and many more. For a more detailed guide, check the SparkRM: Process Pushdown section in the documentation.

Support for Hadoop user impersonation (“proxy” user)

RapidMiner Radoop 7.4 now also supports Hadoop user impersonation, significantly simplifying Radoop connection setup and management when connecting to a Hadoop cluster using RapidMiner Server. A Radoop connection on RapidMiner Server can be defined using the credentials (password or keytab) of a Hadoop “proxy” super-user. When a RapidMiner Studio user logs in to RapidMiner Server, she is authenticated using her RapidMiner credentials. Once logged in, whenever she runs a Radoop job, the super-user then impersonates the RapidMiner user and the job will have the rights and privileges granted to that same user in Hadoop.

This approach reduces administrative work as a single Radoop connection in RapidMiner Server can be used by multiple users. It is especially useful in multi-user installations. For details on the configuration, see the guide Using Hadoop user impersonation in the Radoop connection.

Enhancements and bug fixes

The following improvements are part of RapidMiner Radoop 7.4.

Enhancements

  • Added user impersonation (proxy user) capabilities: a superuser can now impersonate the RapidMiner Server user on the cluster
  • Added SparkRM operator for parallel process pushdown onto the cluster
  • Radoop Proxy is now disabled when running process through Server
  • Added Spark 2.1 support (new option in the Spark Version list)
  • Type Conversion now allows to use an attribute filter, so it is now easy to convert multiple (or all) attributes
  • Single Process Pushdown no longer warns for certain operators that they may not work properly
  • In case of a Hive connection error, more details may be revealed in the Log
  • Textfile is now the default input format for all Spark operators instead of Parquet (sometimes better performance and smaller risk of 2GB partition limit problem)
  • Annotations of data sets inside Radoop Nest are now kept even after a Store and a Retrieve operator (stored in Hive metadata)
  • Single Process Pushdown no longer tries to run its subprocess second time, if there is a well known process error
  • Add noise now has a local random seed parameter
  • Generate Data now allows to define the number of partitions on the output data set and calculates this number by using heuristics by default
  • Generate Data now allows to specify the file format of the output, and Textfile became the default instead of Parquet
  • When running on Server, the JBoss configuration and log directories are the primary paths for the radoop_connections.xml and log files
  • When closing Studio, it will wait if temporary tables are being dropped
  • The Log panel reports when a submitted Spark job is waiting for free resources for minutes
  • In case of using LDAP for Hive (empty Hive Principal field), Kerberos settings are ignored in the Hive connection
  • A specific error message is shown if there is a timeout in a Hive-on-Spark job
  • There is no design-time warning now for some core operators when they are used inside a Radoop Nest

Bug fixes

  • BUGFIX: Fixed issues with Kerberos ticket renewal in long-running Studio
  • BUGFIX: Fixed accesswhitelist option in Radoop connections
  • BUGFIX: Connection import from Cloudera Manager no longer fails if cluster name contains a space (like Cloudera Quickstart)
  • BUGFIX: Unsupported attribute filter types (block_type, no_missing_values, numeric_value_filter) can no longer be selected for Radoop operators
  • BUGFIX: Single Process Pushdown now returns the missing values correctly for integer, nominal and date attributes
  • BUGFIX: Single Process Pushdown now does not lose the roles when creating an in-memory example set on an IOObject input port.
  • BUGFIX: Single Process Pushdown no longer overwrites attributes when "canonical" names collide (e.g. when two attribute names only differ in case)
  • BUGFIX: Single Process Pushdown no longer fails with "getNominalMapping() is not supported" when the input Hive table is in PARQUET format and has TINYINT or SMALLINT columns (see HIVE-14294).
  • BUGFIX: Fixed that Single Process Pushdown and Generate Data did not clean temporary tables on their output
  • BUGFIX: Fixed misleading Hive connection error (TTransportException: SASL authentication not complete)
  • BUGFIX: Fixed potential issues caused by reusing Hive connections with different properties
  • BUGFIX: Import from Amazon S3 dialog now only lists supported file formats
  • BUGFIX: Replace with applicable Radoop operator quickfix now adds the multiclass Decision Tree Radoop operator, and not the old binominal version
  • BUGFIX: Changes in Radoop Proxy settings involved in already established connections are now properly applied without a restart.