Categories

Versions

You are viewing the RapidMiner Server documentation for version 8.0 - Check here for latest version

Cold Swap Steps

This section will list exemplary all the necessary steps to ensure a successful cold swap after your primary RapidMiner Server went down. In basically all real world scenarios, you'll want to automate monitoring of your RapidMiner Server instance and also automatically handle the cold swap. At the bottom of the page you will find a basic example bash script which can take care of the cold swap.

RapidMiner Server became unhealthy, what now?

The reason for the unhealthy state may not known at this point, it could either be temporary or permanent. But whatever the reason is - a backup server needs to be started as soon as possible. The following, basic step-by-step instructions should be followed for a successful cold swap failover:

  1. As soon as the health check indicates an unhealthy state, start a backup server.
  2. Ensure that the backup server is healthy.
  3. If not done by the failover mechanism (e.g. a load balancer), change it to point to the now running backup server address instead of the primary server address.
  4. Optional: Keep monitoring primary server health to know when to shut down the backup server again and point failover mechanism back to the primary server.

These are the basic steps of a cold swap. Additional steps could be added based on your actual high availability setup and needs to adapt the swap to your setup or make it more sophisticated.

Example Cold Swap Script

The script below can be used on a backup machine to monitor the primary RapidMiner Server instance and do the cold swap if the primary server becomes unhealthy. There are four states that are covered by this script:

  • Health check of primary server is ok and local backup JBoss is not running: everything is ok, nothing to do
  • Health check of primary server is ok and local backup JBoss is running: master is back, backup JBoss should be killed (failback)
  • Health check of primary server fails and local backup JBoss is not running: failover is needed, JBoss gets started
  • Health check of primary server fails and local backup JBoss is running: master is dead, but backup is running, nothing to do

Note: The script is only a very basic example and you will need to adapt it to your needs. There are plenty of options to improve upon it as its purpose is to serve as a starting point for an individual high availability setup. It also assumes a setup where only a single backup server exists, which is a limitation you may not want. Furthermore, this script kills the backup server if the primary server becomes healthy again. This might also be something you want to change and rather swap back to the original primary instance manually.

Note: As mentioned in the General Setup, it is important that only a single RapidMiner Server instance runs at the same time! If multiple servers run at the same time, a number of undesirable things can happen, e.g. any of the servers might run scheduled jobs and it might not be the one you want.

#!/bin/bash

# Change according to your setup
MASTER= "master_ip:master_port"
DEBUG=1

function log
{
    DATESTR=`date`
    if [ "$DEBUG" == "1" ]; then
        echo "$DATESTR - $1"
    fi
}

function do_health_check
{
    retval=`curl --silent --max-time 60 http://${MASTER}/api/rest/public/healthcheck | jq ".healthy"`
    if [ "$retval" == "true" ]; then
        log "HC: success"
        return 1;
    else
        log "HC: failed"
        return 0;
    fi
}

function do_health_check_http_status
{
    retval=`curl --silent -I --max-time 60 http://${MASTER}/api/rest/public/healthcheck | head -n 1 | cut -f 2 -d" "`
    if [ "$retval" == "200" ]; then
        log "HC: success"
        return 1;
    else
        log "HC: failed"
        return 0;
    fi

}

function do_failover
{
    sudo /opt/rmserver/bin/standalone.sh &> /var/log/rmserver.log &
    log "Failover done";
}

function do_failback
{
    kill -9 `ps ax | grep jboss | grep java | awk ' { print $1 }'`
    log "Failback done";
}

function checkjboss
{
    retval=`ps ax | grep jboss | grep rmserver | wc -l`
    if [ "$retval" -gt 0 ]; then
        log "Jboss running on this host"
        return 1;
    else
        log "Jboss not running on this host"
        return 0;
    fi
}

function main_loop
{
    counter=0
    while true;
    do
        checkjboss;
        jbossret=$?

        do_health_check_http_status;
        healthret=$?

        # If master is healthy, then reset the counter
        if [ "$healthret" == "1" ]; then
            counter=0;
        fi

        # If Jboss is running on this (secondary) host, but master is back again, then let's stop this instance
        if [ "$jbossret" == "1" ] && [ "$healthret" == "1" ]; then
            log "Jboss is running on this host but master is healthy again"
            do_failback;
        fi

        # If Jboss is running on this (secondary) host and master is dead, then nothing to do
        if [ "$jbossret" == "1" ] && [ "$healthret" == "0" ]; then
            log "Jboss is running on this host and master is not healthy yet"
        fi

        # Master is healthy, nothing to do
        if [ "$jbossret" == "0" ] && [ "$healthret" == "1" ]; then
            log "Master is healthy, nothing to do"
        fi

        # Master looks like dead, let's try it again for 3 times
        if [ "$jbossret" == "0" ] && [ "$healthret" == "0" ]; then
            ((counter++))

            log "Master looks like dead ($counter health checks failed)"

            if [ "$counter" -gt 3 ]; then
                log "It is not a joke, we should do a failover asap!"
                do_failover;
            fi
        fi
        sleep 5;
    done
}

main_loop