Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 10.1 - Check here for latest version

Installing RapidMiner Radoop on RapidMiner Studio

RapidMiner Radoop is client software with an easy-to-use graphical interface for processing and analyzing big data on a Hadoop Hadoop cluster. It can be installed on RapidMiner Studio and/or RapidMiner Server, and provides a platform for editing and running ETL, data analytics, and machine learning processes in a Hadoop environment. RapidMiner Radoop runs on any platform that supports Java.

Integrating RapidMiner Radoop into the RapidMiner advanced analytics suite is as easy as downloading the extension and making some configuration changes. The following instructions describe the process for installing the RapidMiner Radoop extension.

Prerequisites

The installation instructions assume that you have completed the following tasks. If any of these prerequisites have not yet been met, be sure to finish them before proceeding with the installation.

Component Notes
RapidMiner You need RapidMiner Studio, and optionally, RapidMiner Server installed. If necessary, see the instructions for RapidMiner Studio installation or RapidMiner Server installation.
RapidMiner Radoop license Radoop free license is automatically downloaded once logged in. (Note that Radoop Basic is not enough to use Radoop.) If you are interested in enabling advanced capabilities and support, contact us to purchase a RapidMiner Radoop license.
Hadoop cluster RapidMiner Radoop requires connection to a properly configured Hadoop cluster. See Hadoop cluster requirements and supported Hadoop distributions.
A distributed data warehouse system RapidMiner Radoop supports Apache Hive or Impala. The system must be installed on a Hadoop cluster. See the supported data warehouse systems.
Networking Setup Make sure that RapidMiner Radoop can connect to your Hadoop cluster. After installing RapidMiner Radoop and creating connections, refer to networking setup for more information.

Verifying port availability for RapidMiner Radoop

RapidMiner Radoop requires access to a variety of ports on the cluster. Make note of your port assignments for later use when configuring cluster connections and security settings. The table in the networking setup section lists the default port assignments for various components.

Hadoop cluster requirements

RapidMiner Radoop requires a connection to a properly configured Hadoop cluster where it will execute all of its main data processing operations and store the data related to these processes. The cluster contains the following components:

RapidMiner Radoop supports most Spark versions 1.6.0 and above. See the table below for information for which Radoop Spark operators work with specific Spark versions.

Spark features Spark version 1.6.x Spark version 2.0.x/2.1.x/2.2.x/2.3.x/2.4.x
Linear Regression
Logistic Regression
Decision Tree (MLlib binominal)
Support Vector Machine
Decision Tree
Random Forest
Single Process Pushdown
SparkRM
Spark Script
K-Means
Isolation Forest

Using all Spark operators

If you want to use every Spark operator and your Hadoop cluster does not have 1.6 or above, then it needs to be installed on the cluster manually. You can do so by downloading it from the Apache Spark download page. Please take care that the package type should meet your cluster setup.

  • Installing Spark 1.6.0 for Hadoop 2.6 or later (you need to change the download link and the path for older Hadoop or newer Spark versions):

     hadoop fs -mkdir -p /tmp/spark
     wget -O /tmp/spark-1.6.0-bin-hadoop2.6.tgz http://d3kbcqa49mib13.cloudfront.net/spark-1.6.0-bin-hadoop2.6.tgz
     tar xzvf /tmp/spark-1.6.0-bin-hadoop2.6.tgz -C /tmp/
     hadoop fs -put /tmp/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar /tmp/spark/
    

For using the Spark Script operator, you need to have Python 2.6+ or Python 3.4+ (for PySpark scripts) and R 3.1+ (for SparkR scripts) installed on the cluster nodes. To be able to use MLlib functions in Python, please also install the numpy package. Because of PARQUET-136 Hive version 1.2.0 or later is recommended.

Consider the following differences between using Hive and Impala as the query engine for RapidMiner Radoop.

The following list contains the features unsupported by the Impala 1.2.3 release.

  • Sort operator: Impala does not support the ORDER BY clause without a LIMIT specified (or, since Impala version 1.4.0, only with certain restrictions that Radoop does not comply with). You may use the Hive Script operator to perform a sort by using an explicit LIMIT clause as well.

  • Add Noise operator: Add Noise is not supported on Impala.

  • Nominal to Numerical operator: Unique integers method of Nominal to Numerical is not supported on Impala.

  • Pivot Table operator: Pivot Table is not supported on Impala.

  • Apply Model operator: Model application with Impala is not supported.

  • Update Model and Naive Bayes operators: On Impala, RapidMiner Radoop does not support Naive Bayes learning or model updating by operator.

  • Correlation Matrix, Covariance Matrix, and Principal Component Analysis operators: The CORR() function is not supported by Impala.

  • Performance operators: The Performance (Regression) operator is not supported on Impala. For the Performance (Classification) operator, only the following criterions are supported on Impala: Accuracy, Classification Error, and Kappa.

  • Aggregation functions: Some aggregation functions are not supported by Impala. This may affect Generate Attributes, Normalize, and Aggregate operators. For these limitations, RapidMiner Radoop provides design-time errors, even though Impala allows you to run them.

  • No advanced Hive settings: You cannot set advanced Hive parameters for an Impala connection.

Hadoop cluster considerations

Although RapidMiner Radoop easily connects to all supported platform, you may require special settings if you encounter a problem when trying to use it with one of the listed distributions. Details can be found in the Distribution Specific Notes section. This section lists a few considerations that you should be aware of when choosing an HDFS or data warehousing platform:

Cloudera Impala is an open-source query engine over Apache Hadoop. It provides a low-latency interface to data stored in the HDFS for SQL queries, making RapidMiner Radoop usage closer to the experience of using it in a single host environment. While Cloudera Impala can provide much faster response time than Hive, it does not support all the features of HiveQL.

Evaluate the Impala limitations to determine whether it is an acceptable alternative for your organization. For example, if you need advanced features (like model scoring), you must use Hive. If you use both Hive and Impala, consult the Impala Documentation for information on sharing metadata between the two frameworks. If using both, metadata used in Impala must be reloaded to reflect any metadata changes (such as creating new tables) made in Hive. (This can be done by enabling the reload impala metadata parameter of the Radoop Nest.)

Installing RapidMiner Radoop on RapidMiner Studio

The RapidMiner Radoop client installation is straight-forward, assuming the prerequisites are met and the appropriate ports are available. The extension can be easily RapidMinerinstalled from the Marketplace.

If you want to install the extension manually, follow the steps below.

In Step 3, you will move the files to:

There are two options for the installation, please choose one.

For enabling the plugin for all users on a machine (global install), move the files into the install folder at lib/plugins.

In case of RapidMiner Studio versions 6.4 and later, for enabling the plugin only for a single user, move the files to .RapidMiner/extensions/ at the user home folder. If the extensions folder does not exist, create it.

For Mac users running RapidMiner Studio versions 6.4 and later, move the files into .RapidMiner/extensions/. If the extensions folder does not exist, create it. Note that RapidMiner Studio creates .RapidMiner as a hidden folder, so you must set your Mac to display hidden files and folders if you cannot see it.

For Mac users running RapidMiner Studio versions prior to 6.4, move the files into the install folder at lib/plugins.)

The process is as follows:

  1. If necessary, quit RapidMiner Studio.

  2. Download the RapidMiner Radoop plugin, a JAR file, from the location specified in your confirmation email.

  3. Move the downloaded RapidMiner Radoop JAR file (rapidminer-Radoop-onsite-<version>.jar) file to the RapidMiner Studio directory on the host system.

  4. With the JAR files moved, start RapidMiner.

If the extension has been successfully intalled, Hadoop Data appears in the middle, as a new view, in the RapidMiner Studio startup window:

That's it. Now that RapidMiner Radoop is installed, see the section on configuring connections to complete the installation.

Considering security

Consider the following security measures to secure your HDFS and data warehouse infrastructure:

  • Apply the firewall settings for your data warehouse system (optional but recommended).
  • Use Kerberos or Apache Sentry for securing your cluster. See the Hadoop security section for security configuration suggestions.