Categories

Versions

You are viewing the RapidMiner Radoop documentation for version 10.1 - Check here for latest version

Connecting to a 3.0.1+ Hortonworks Sandbox

As of this writing the latest available version of Hortonworks Data Platform (HDP) on Hortonworks Sandbox VM is 3.0.1. This guide was created for that.

Start and configure the Sandbox VM

  1. Download the Hortonworks Sandbox VM for VirtualBox from the Download website.

  2. Import the OVA packaged VM to your virtualization environment (Virtualbox is covered in this guide).

  3. Start the VM. After powering it on, you have to select the first option from the boot menu, then wait for the boot to complete.

  4. Log in to the VM. You can do this by switching to the login console (Alt+F5), or even better via SSH on localhost port 2122. It is important to note that there are 2 exposed SSH ports on the VM, one belongs to the VM itself (2122), while the other (2222) belongs to a Docker container running inside the VM. The username is root, the password is hadoop for both.

  5. Edit the /sandbox/proxy/generate-proxy-deploy-script.sh by include the following ports in the tcpPortsHDP array 8025, 8030, 8050, 10020, 50010.

    1. vi /sandbox/proxy/generate-proxy-deploy-script.sh
    2. Find tcpPortsHDP variable, leaving the other values in place, add to the hashtable assignment:

      [8025]=8025
      [8030]=8030
      [8050]=8050
      [10020]=10020
      [50010]=50010
      
  6. Run the edited generate-proxy-deploy-script.sh via /sandbox/proxy/generate-proxy-deploy-script.sh

    • This will re-create the /sandbox/proxy/proxy-deploy.sh script along with config files in /sandbox/proxy/conf.d and /sandbox/proxy/conf.stream.d, thus exposing the additional ports added to the tcpPortsHDP hashtable in previous step.
  7. Run the /sandbox/proxy/proxy-deploy.sh script via /sandbox/proxy/proxy-deploy.sh

    • Running the docker ps command, will show an instance named sandbox-proxy and the ports it has exposed. The inserted values to the tcpPortsHDP hashtable should be shown in the output, looking like 0.0.0.0:10020->10020/tcp.
  8. These changes only made sure that the referenced ports of the Docker container are accessible on the respective ports of the VM. Since the network adapter of the VM is attached to NAT, these ports are not accessible from your local machine. To make them available you have to add the port forwarding rules listed below to the VM. In VirtualBox you can find these settings under Machine / Settings / Network / Adapter 1 / Advanced / Port Forwarding.

    Name Protocol Host IP Host Port Guest IP Guest Port
    resourcetracker TCP 127.0.0.1 8025 8025
    resourcescheduler TCP 127.0.0.1 8030 8030
    resoucemanager TCP 127.0.0.1 8050 8050
    jobhistory TCP 127.0.0.1 10020 10020
    datanode TCP 127.0.0.1 50010 50010
  9. Edit your local hosts file (on your host operating system, not inside the VM), add sandbox.hortonworks.com and sandbox-hdp.hortonworks.com to your localhost entry. At the end it should look something like this:

    127.0.0.1 localhost sandbox.hortonworks.com sandbox-hdp.hortonworks.com

  10. Reset Ambari access. Use an SSH client to login to localhost as root, this time using port 2222! (For example, on OS X or Linux, use the command ssh root@localhost -p 2222, password: hadoop)

    • (At first login you have to set a new root password, do it and remember it.)
    • Run ambari-admin-password-reset as root user.
    • Provide a new admin password for Ambari.
    • Run ambari-agent restart.
  11. Open the Ambari website: http://sandbox.hortonworks.com:8080

    • Login with admin and the password you chose in the previous step.
    • Navigate to the YARN / Configs / Memory configuration page.
    • Edit the Memory Node Setting to at least 7 GB and click Override.
      • User will be prompted to create a new "YARN Configuration Group", enter a new name.
      • On the "Save Configuration Group" dialog, click the Manage Hosts button.
      • On the "Manage YARN Configuration Groups page" take the node in the "Default" group and add the node into the group created in the "YARN Configuration Group" name step.
      • "Warning" Dialog will open requesting adding notes click the Save button.
      • "Dependent Configurations" dialog will open with Ambari providing recommendations to modify some related properties automatically. If so, untick tez.runtime.io.sort.mb to keep its original value. Click the Ok button.
        • Ambari may open a "Configurations" page suggesting stuff. Review accordingly, but this is out of the scope of this document, so just click Proceed Anyway.
    • Navigate to the Hive / Configs / Advanced configuration page.
    • In the Custom hiveserver2-site section. The hive.security.authorization.sqlstd.confwhitelist.append needs to be added via the Add Property... and be set to the following (it must not contain whitespaces):

      radoop\.operation\.id|mapred\.job\.name|hive\.warehouse\.subdir\.inherit\.perms|hive\.exec\.max\.dynamic\.partitions|hive\.exec\.max\.dynamic\.partitions\.pernode|spark\.app\.name|hive\.remove\.orderby\.in\.subquery 
      
    • Save the configuration and restart all affected services. More details on hive.security.authorization.sqlstd.confwhitelist.append can be found in Hadoop Security/Configuring Apache Hive SQL Standard-based authorization section.

Setup the connection in RapidMiner Studio

  1. Click on New Connection Icon New Connection button and choose Import from Manager Icon Import from Cluster Manager option to create the connection directly from the configuration retrieved from Ambari.

  2. On the Import Connection from Cluster Manager dialog enter

    • Cluster Manager URL: http://sandbox-hdp.hortonworks.com:8080
    • Username: admin
    • Password: password used in Reset Amabari step.
  3. Click Import Configuration

  4. Hadoop Configuration Import dialog will open up

    • If successful click Next button and Connection Settings dialog will open.
    • If failed click Back button and review above steps and logs to solve issue(s).
  5. On the Connection Settings Dialog, which opens when Next button is clicked from step above.

  6. Connection Name can stay defaulted or be changed by user.

  7. Global tab

    • Hadoop Version should be Hortonworks HDP 3.x
    • Set Hadoop username to hadoop.
  8. Hadoop tab

    • NameNode Address should be sandbox-hdp.hortonworks.com
    • NameNode Port should be 8020
    • Resource Manager Address should be sandbox-hdp.hortonworks.com
    • Resource Manager Port should be 8050
    • JobHistory Server Address should be sandbox-hdp.hortonworks.com
    • JobHistory Server Port should be 10020
    • Advanced Hadoop Parameters add the following parameters:

      Key Value
      dfs.client.use.datanode.hostname true

      (This parameter is not required when using the Import Hadoop Configuration Files option):

      Key Value
      mapreduce.map.java.opts -Xmx256m
  9. Spark tab

    • Spark Version select Spark 2.3 (HDP)
    • Check Use default Spark path
  10. Hive tab

    • Hive Version should be HiveServer3 (Hive 3 or newer)
    • Hive High Availability should be checked
    • ZooKeeper Quorum should be sandbox-hdp.hortonworks.com:2181
    • ZooKeeper Namespace should be hiverserver2
    • Database Name should be default
    • JDBC URL Postfix should be empty
    • Username should be hive
    • Password should be empty
    • UDFs are installed manually and Use custom database for UDFs are both unchecked
    • Hive on Spark/Tez container reuse should be checked
  11. Click OK button, the Connection Settings dialog will close

  12. User can test the connection created above onn Manage Radoop Connections page select the connection created and clicking the Quick Test and Full Test IconFull Test... buttons.

If errors occur durning testing confirm that necessary Components are started correctly at http://localhost:8080/#/main/hosts/sandbox-hdp.hortonworks.com/summary.