Installing RapidMiner Radoop on RapidMiner Studio
RapidMiner Radoop is client software with an easy-to-use graphical interface for processing and analyzing big data on a Hadoop cluster. It can be installed on RapidMiner Studio and/or RapidMiner Server, and provides a platform for editing and running ETL, data analytics, and machine learning processes in a Hadoop environment. RapidMiner Radoop runs on any platform that supports Java.
Integrating RapidMiner Radoop into the RapidMiner advanced analytics suite is as easy as downloading the extension and making some configuration changes. The following instructions describe the process for installing the RapidMiner Radoop extension.
The installation instructions assume that you have completed the following tasks. If any of these prerequisites have not yet been met, be sure to finish them before proceeding with the installation.
|RapidMiner||You need RapidMiner Studio, and optionally, RapidMiner Server installed. If necessary, see the instructions for RapidMiner Studio installation or RapidMiner Server installation.|
|RapidMiner Radoop license||Radoop free license is automatically downloaded once logged in. (Note that Radoop Basic is not enough to use Radoop.) If you are interested in enabling advanced capabilities and support, contact us to purchase a RapidMiner Radoop license.|
|Hadoop cluster||RapidMiner Radoop requires connection to a properly configured Hadoop cluster. See Hadoop cluster requirements and supported Hadoop distributions.|
|A distributed data warehouse system||RapidMiner Radoop supports Apache Hive or Impala. The system must be installed on a Hadoop cluster. See the supported data warehouse systems.|
|Networking Setup||Make sure that RapidMiner Radoop can connect to your Hadoop cluster. After installing RapidMiner Radoop and creating connections, refer to networking setup for more information.|
Verifying port availability for RapidMiner Radoop
RapidMiner Radoop requires access to a variety of ports on the cluster. Make note of your port assignments for later use when configuring cluster connections and security settings. The table in the networking setup section lists the default port assignments for various components.
Hadoop cluster requirements
RapidMiner Radoop requires a connection to a properly configured Hadoop cluster where it will execute all of its main data processing operations and store the data related to these processes. The cluster contains the following components:
- a supported Hadoop distribution, which consists of an HDFS and YARN
- a distributed data warehouse system (Hive or Impala)
- Java 8 on the cluster nodes (necessary for applying most RapidMiner models in-Hadoop and using Process Pushdown operators)
- optionally, Apache Spark. Below you can find detailed descriptions about the Spark requirements on the cluster.
|Spark features||Spark version 1.2.x/1.3.x/1.4.x||Spark version 1.5.x/1.6.x||Spark version 2.0.x/2.1.x/2.2.x/2.3.x|
|Decision Tree (MLlib binominal)|
|Support Vector Machine|
|Single Process Pushdown|
Using all Spark operators
If you want to use every Spark operator and your Hadoop cluster does not have 1.5 or above, then it needs to be installed on the cluster manually. You can do so by downloading it from the Apache Spark download page. Please take care that the package type should meet your cluster setup.
Installing Spark 1.5.2 for Hadoop 2.6 or later (you need to change the download link and the path for older Hadoop or newer Spark versions):
hadoop fs -mkdir -p /tmp/spark wget -O /tmp/spark-1.5.2-bin-hadoop2.6.tgz http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz tar xzvf /tmp/spark-1.5.2-bin-hadoop2.6.tgz -C /tmp/ hadoop fs -put /tmp/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar /tmp/spark/
For using the Spark Script operator, you need to have Python 2.6+ or Python 3.4+ (for PySpark scripts) and R 3.1+ (for SparkR scripts) installed on the cluster nodes. To be able to use MLlib functions in Python, please also install the numpy package. Because of PARQUET-136 Hive version 1.2.0 or later is recommended.
Consider the following differences between using Hive and Impala as the query engine for RapidMiner Radoop.
Sort operator: Impala does not support the ORDER BY clause without a LIMIT specified (or, since Impala version 1.4.0, only with certain restrictions that Radoop does not comply with). You may use the Hive Script operator to perform a sort by using an explicit LIMIT clause as well.
Add Noise operator: Add Noise is not supported on Impala.
Nominal to Numerical operator: Unique integers method of Nominal to Numerical is not supported on Impala.
Pivot Table operator: Pivot Table is not supported on Impala.
Apply Model operator: Model application with Impala is not supported.
Update Model and Naive Bayes operators: On Impala, RapidMiner Radoop does not support Naive Bayes learning or model updating by operator.
Correlation Matrix, Covariance Matrix, and Principal Component Analysis operators: The CORR() function is not supported by Impala.
Performance operators: The Performance (Regression) operator is not supported on Impala. For the Performance (Classification) operator, only the following criterions are supported on Impala: Accuracy, Classification Error, and Kappa.
Aggregation functions: Some aggregation functions are not supported by Impala. This may affect Generate Attributes, Normalize, and Aggregate operators. For these limitations, RapidMiner Radoop provides design-time errors, even though Impala allows you to run them.
No advanced Hive settings: You cannot set advanced Hive parameters for an Impala connection.
Hadoop cluster considerations
Although RapidMiner Radoop easily connects to all supported platform, you may require special settings if you encounter a problem when trying to use it with one of the listed distributions. Details can be found in the Distribution Specific Notes section. This section lists a few considerations that you should be aware of when choosing an HDFS or data warehousing platform:
Cloudera Impala is an open-source query engine over Apache Hadoop. It provides a low-latency interface to data stored in the HDFS for SQL queries, making RapidMiner Radoop usage closer to the experience of using it in a single host environment. While Cloudera Impala can provide much faster response time than Hive, it does not support all the features of HiveQL.
Evaluate the Impala limitations to determine whether it is an acceptable alternative for your organization. For example, if you need advanced features (like model scoring), you must use Hive. If you use both Hive and Impala, consult the Impala Documentation for information on sharing metadata between the two frameworks. If using both, metadata used in Impala must be reloaded to reflect any metadata changes (such as creating new tables) made in Hive. (This can be done by enabling the reload impala metadata parameter of the Radoop Nest.)
Installing RapidMiner Radoop on RapidMiner Studio
If you want to install the extension manually, follow the steps below.
There are two options for the installation, please choose one.
For enabling the plugin for all users on a machine (global install), move the files into the install folder at
In case of RapidMiner Studio versions 6.4 and later, for enabling the plugin only for a single user, move the files to
.RapidMiner/extensions/ at the user home folder. If the extensions folder does not exist, create it.
For Mac users running RapidMiner Studio versions 6.4 and later, move the files into
.RapidMiner/extensions/. If the extensions folder does not exist, create it. Note that RapidMiner Studio creates
.RapidMiner as a hidden folder, so you must set your Mac to display hidden files and folders if you cannot see it.
For Mac users running RapidMiner Studio versions prior to 6.4, move the files into the install folder at
The process is as follows:
If necessary, quit RapidMiner Studio.
Download the RapidMiner Radoop plugin, a JAR file, from the location specified in your confirmation email.
Move the downloaded RapidMiner Radoop JAR file (
rapidminer-Radoop-onsite-<version>.jar) file to the RapidMiner Studio directory on the host system.
With the JAR files moved, start RapidMiner.
If the extension has been successfully intalled, Hadoop Data appears in the middle, as a new view, in the RapidMiner Studio startup window:
That's it. Now that RapidMiner Radoop is installed, see the section on configuring connections to complete the installation.
Consider the following security measures to secure your HDFS and data warehouse infrastructure:
- Apply the firewall settings for your data warehouse system (optional but recommended).
- Use Kerberos or Apache Sentry for securing your cluster. See the Hadoop security section for security configuration suggestions.