Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 10.0 - Check here for latest version

Azure HDInsight 4.0

Configuring the Hadoop cluster

RapidMiner Radoop supports version 4.0 of Azure HDInsight, a cloud-based Hadoop service that is built upon Hortonworks Data Platform (HDP) distribution.

If you don't have an HDInsight cluster running in the Azure network, you can follow the Azure documentation to create one. Make sure to select Spark as a cluster type.

Azure Data Lake Storage Gen2 as primary storage and Enterprise security package are not yet supported by Radoop in case of HDInsight 4.0

Hive setup

Complex functionality of Radoop is partly achieved by defining custom functions (UDF, UDAF and UDTF) to Hiveserver2 extending its capabilities.

Networking

If your networking allows direct access (DNS and reverse DNS for all hostnames including the alias) to all of the cluster nodes then you can skip this step.

Please follow the general description for networking setup for accessing Hadoop cluster. In case of an isolated network setup, Radoop users will need the connection details for a deployed Radoop Proxy.

Setting up the connection in RapidMiner Studio

We strongly recommend using the Import from Cluster Manager tool to create the connection, as several advanced properties required for correct operation are seamlessly gathered from the cluster during the import process.

  1. Use Import from Manager Icon Import from Cluster Manager to create the connection directly from the configuration retrieved from Ambari.

  2. On Hadoop tab, under Advanced Hadoop Parameters provide storage credentials for the primary storage of the HDInsight cluster.

    Azure Storage credentials: On the Azure storage dashboard find the Access keys tab. Copy one of the keys and set is as the value of fs.azure.account.key.<storage_name>.blob.core.windows.net parameter in your Radoop Connection.

  3. On the Hive tab, enter the Database Name to connect to. Choose a database where privileges for all operations are granted for the given user. Tick UDFs are installed manually.

  4. In case of using Radoop Proxy there should be a proxy connection ready to it. As a final step for a Radoop Connection tick Use Radoop Proxy on the Radoop Proxy tab and select a Radoop Proxy Connection which had been created for this cluster.