Installing RapidMiner Radoop on RapidMiner Server

Prerequisites

The following requirements must be met before installing the RapidMiner Radoop extension on RapidMiner Server:

  • RapidMiner Radoop Extension installed and tested on RapidMiner Studio. If necessary, see Configuring RapidMiner Radoop Connections to ensure that you have a valid connection to a Hadoop cluster in RapidMiner Studio.

Installing RapidMiner Radoop on RapidMiner Server and the connected Job Agent(s)

Installing the RapidMiner Radoop client on RapidMiner Server requires that you copy files from your RapidMiner Studio configuration into your RapidMiner Server or Job Agent installations. You need to prepare with the following artifacts to accomplish the installation:

  1. RapidMiner Radoop Extension (a Jar file). You can download RapidMiner Radoop extension from the Marketplace or you can get it on your desktop computer from your local .RapidMiner configuration directory (created by RapidMiner Studio).

  2. Radoop license (a license string and/or a .lic file). RapidMiner Radoop license needs manual installation on RapidMiner Server (note that Radoop Basic license is not enough to use Radoop). You can get it on the https://my.rapidminer.com or you can locate the license file on your desktop computer in your local .RapidMiner configuration directory (created by RapidMiner Studio).

  3. Radoop Connection definitons (an XML file). Locate the radoop_connections.xml in your local .RapidMiner configuration directory (created by RapidMiner Studio).

Installing RapidMiner Radoop on RapidMiner Server

  1. Stop the server.

  2. You should add the extension Jar file to the extension or plugin directory of RapidMiner Server:

    To determine the location of your RapidMiner Server plugins directory, from the RapidMiner Server home page open Administration and then System Settings. The value of the com.rapidanalytics.plugindir system setting indicates the location of the directory.

    Starting from RapidMiner Server version 9.0 a RapidMiner Server Home Directory is introduced. You can also copy the extension Jar file into its home/resources/extensions/ subfolder.

  3. On the Server Web UI, navigate to Administration > Manage Licenses and check your Radoop license under Active licenses. If it is a Radoop Basic license, click on Install License in the Actions menu (located on the right side by default) and paste your Radoop license in the text field.

  4. Restart the server.

Installing RapidMiner Radoop on RapidMiner Server Job Agents

You should do the following steps on each Job Agent connected to your RapidMiner Server.

  1. Stop the Job Agent

  2. Add the extension Jar to the extensions directory of each Job Agent. For details see the instructions for Job Agents configuration.

  3. In a multi-user Server environment, please see the Configuring and securing multiple connections section. The final radoop_connections.xml must be placed in the container properties folder of Job Agents. Copy or link the file into the home/config/rapidminer/.RapidMiner/ folder.

  4. For the Job Agents, you need to copy the installed license files to the Job Agents' home/resources/licenses/radoop/ folder as well.

  5. Start the Job Agent

Managing Radoop connections on RapidMiner Server

Radoop connections are stored in radoop_connections.xml on the server side, but there is no GUI on the server to edit the connections. Connections should be edited on the client side using RapidMiner Studio and added to the server as an XML file.

In a multi-user environment the Rapidminer Server administrator needs to manually edit the radoop_connections.xml file on Server and Job Agents to make sure that all connections are included. The radoop_connections.xml file can list an arbitrary number of connections. These connections may point to the same Hadoop cluster or may point to different clusters. They may define connections for the same user or for different users (e.g., with different Hadoop username fields).

The connection file on RapidMiner Server should list all connections that may be used by any process submitted to this Server. The connection names must be the same on the Server and in the RapidMiner Studio instance that submits the process.

RapidMiner Server does not need to be restarted if radoop_connections.xml is modified. The changes are applied immediately, more precisely, all process executions after the modification will use the modified connection, because the xml file is re-read from the disk, but already running processes are unaffected.

In a multi-user RapidMiner Server environment, two different configuration solutions are available for creating Radoop connections:

  1. Dedicated Radoop connection for each client user on the server side, or
  2. one connection with the credentials of a privileged Hadoop user, a user allowed to impersonate other users. (see Apache Hadoop user impersonation)

Option #1: Creating dedicated Hadoop connections for the client users

This approach requires a dedicated connection definition for each user, and administrators must take care of connection name conflicts. RapidMiner Studio users only need to have their own connection(s) in their local connection file on their client machine. On the server side, there will be multiple connections defined in the connection file. An example for naming the connections: clustername_username, where clustername is an identifier for the Hadoop cluster and username is an identifier for the user (e.g. that may be the same as the value of the Hadoop username field). Edit XML... option on the Connection Settings dialog can be used to copy each user's connection entry into the merged radoop_connections.xml on the Server.

To control the access rights to these connections, e.g. so that one user can only use his/her own connection when submitting processes to the Server, each connection should set the so called Access Whitelist field to the corresponding username. See Access control on Radoop connections for details.

Option #2: Using Hadoop user impersonation in the Radoop connection

Hadoop user impersonation is available for Radoop connections. This approach enables the administrators to add a single connection to RapidMiner Server with the credentials of a privileged Hadoop user, who is able to impersonate other Hadoop users. This approach results in less maintenance and simpler access right management, while the credentials of the users (encrypted passwords or keytabs) are not stored on the server. Please note that using a keytab for the privileged superuser is strongly recommended, as the ticket renewal is not fully supported in case of using a password.

Hadoop-side configuration for impersonation

On the Hadoop side, there should be a dedicated user (username can be e.g. privilegeduser), who has the rights to impersonate others. This configuration can be done based on the Hadoop documentation. In a simple case, the following snippet should be added to the core-site.xml in the Hadoop Configuration:

<property>
    <name>hadoop.proxyuser.privilegeduser.hosts</name>
    <value>*</value>
</property>
<property>
    <name>hadoop.proxyuser.privilegeduser.groups</name>
    <value>*</value>
</property>

If HDFS Encryption (and KMS service) is enabled, the similar settings should be also ensured in the kms-site.xml. For detailed information please visit the KMS Proxyuser Configuration section on the KMS documentation page or follow the instructions of your Hadoop vendor.

Creating and testing the connection for RapidMiner Server

Similar to the other approach, a connection should be constructed using RapidMiner Studio. You can find RapidMiner Server related settings on the RapidMiner Server tab of the Connection Settings dialog.

As on the screenshot above, the Enable impersonation on Server checkbox should be enabled and the credentials of the superuser should be entered to the Server Principal and Server Keytab File or Server Password fields similar to the case with client users (presented in section Hadoop security configuration). In case of LDAP authentication is configured for Hive, the Hive Principal should be empty and the credentials of the privilegeduser should be entered to the Hive Username and Password fields (these two fields are only enabled if Hive Principal is empty).

The connection can be tested from RapidMiner Studio, if the networking setup allows connecting to the Hadoop cluster from the client hosts. If the Impersonated user for local testing field is set (e.g. scott is entered as username), then all the operations are submitted using the privilegeduser credentials, but impersonating the scott user and using its access rights. This field does not have an effect when running on RapidMiner Server: in that case, the Server user will always be the impersonated user.

Securing Radoop connections on RapidMiner Server

RapidMiner Server supports connections to Hadoop clusters with the same security settings as RapidMiner Studio, but you may need to manually edit the connection XML file (e.g. because of different file path settings on the server side). In general, connections should be constructed using RapidMiner Studio (using it as a "connection editor"), and the following additional steps should be considered.

Decrypting connection passwords

RapidMiner Radoop uses the local cipher.key file to encrypt and the key attribute of the radoop-entries tag in the XML file to decrypt the passwords in the radoop_connections.xml file by default. If the radoop_connections.xml contains entries from multiple users, there are two possible solutions:

  1. Creating every user's connection entry on the same computer (with the same cipher.key file), or
  2. it is possible to add a key attribute to each radoop-connection-entry manually. Radoop will use the per-entry key attribute instead of the per-file key.

For example, user John and Scott have the following radoop_connections.xml files:

<radoop-entries key="XkzjmytZW2ffc7+MnU11BdhzomF8355R">
    <radoop-connection-entry>
        <name>connection-john</name>
        ...
    </radoop-connection-entry>
</radoop-entries>
<radoop-entries key="KLS4GvvZta0NhtXfwkXQeSqD11ngXeWP">
    <radoop-connection-entry>
        <name>connection-scott</name>
        ...
    </radoop-connection-entry>
</radoop-entries>

The merged radoop_connections.xml looks like the following:

<radoop-entries key="dontcare">
    <radoop-connection-entry key="XkzjmytZW2ffc7+MnU11BdhzomF8355R">
        <name>connection-john</name>
        ...
    </radoop-connection-entry>
    <radoop-connection-entry key="KLS4GvvZta0NhtXfwkXQeSqD11ngXeWP">
        <name>connection-scott</name>
        ...
    </radoop-connection-entry>
</radoop-entries>

Connection to Hadoop clusters with Kerberos authentication

For configuring a connection to a cluster with Kerberos authentication, see Hadoop security. Please take the following notes when using these connections through RapidMiner Server.

Connecting with Kerberos password

It is possible to use a password to connect to a Kerberized cluster. To make sure that the encrypted passwords in the connection XML can be decrypted on the Server, please refer to the Decrypting connection passwords section. Please note that on the Server side, using a keytab is recommended, as the ticket renewal is not supported in case of using a password.

Connecting with keytab file

Connections to a Kerberized cluster should specify the path for the users keytab file instead of the password. This means that the keytab file must be accessible on the local file system of the Server. The path usually differs from the path on the local file system of the user using RapidMiner Studio. The RapidMiner Server administrator have to ensure that the keytabFile field of the radoop_connections.xml file on the Server points to the appropriate path on the Server. The keytab file itself on the file system should only be accessible for the user running RapidMiner Server.

Note: A RapidMiner Server instance can only talk to a single kerberized Hadoop cluster, more precisely, to a single Kerberos Realm. This limitation comes from the architecture of the Java Kerberos implementation. However, multiple users can use this kerberized Hadoop cluster concurrently through this RapidMiner Server instance.

Connecting to Hive with LDAP authentication

If LDAP is used for authentication to HiveServer2, then passwords should be entered similarly to the Kerberos passwords, please refer to the Decrypting connection passwords section. In case of impersonation, the provided Hive LDAP user should also have Hadoop proxyuser privileges.

Access control on Radoop connections

The availability of a Hadoop connection on RapidMiner Server can be limited to a user or a group of users. This means that a RapidMiner Server user that is not on the optionally specified whitelist of a connection cannot use it when submitting Radoop processes. This way, the Server administrator can make sure that users cannot use connections that they are not permitted to use, and that they cannot evade this restriction by manipulating their connection identifiers in submitted processes.

To define a group (or user) whitelist for a connection, add the accesswhitelist tag for the corresponding radoop-connection-entry in the radoop_connections.xml. The value of this property is an arbitrary regular expression (.* or * can be used for allowing all users). Only RapidMiner Server users whose group matches this expression are allowed to use the connection in a submitted process. If this optional accesswhitelist is not specified for a connection, then any user can use it in a process.

<radoop-connection-entry>
    ....
    <accesswhitelist>ds_group|dba_group|john|scott</accesswhitelist>
</radoop-connection-entry>

Change Radoop Proxy enabled connections

Radoop Proxy is automatically disabled when a process is executed on RapidMiner Server, because in a typical setup, RapidMiner Server runs inside the secure zone, that's why there is no need to route the traffic through the Proxy.

In case you have a custom manual Radoop Proxy installed on an edge node, and RapidMiner Server (besides Studio) can only reach the Hadoop cluster via this edge node (so it runs outside the secure zone), you need to enable Force Radoop Proxy on Server setting on the RapidMiner Server tab. This setting has no effect when running in Studio.

Alternatively, you can manually edit the radoop_connectons.xml file on the Server. In this case add the forceproxyonserver tag with the value T.

<radoop-entries key="XkzjmytZW2ffc7+MnU11BdhzomF8355R">
    <radoop-connection-entry>
        ...
        <forceproxyonserver>T</forceproxyonserver>
        ...
    </radoop-connection-entry>
</radoop-entries>

Please note that the location of the Radop Proxy connection specified in Studio for this connection needs to be the Remote Repository corresponding to this RapidMiner Server instance. Otherwise the process won't be able to find the proxy connection when running on the Server and will fail because of that.