You are viewing the RapidMiner Radoop documentation for version 9.3 - Check here for latest version

RapidMiner Radoop Property Settings

The table below describes the properties that influences RapidMiner Radoop's operation. They are found in the RapidMiner Studio > Settings > Preferences pull-down menu dialog, under the Radoop tab.

Note that each internal key starts with the prefix rapidminer.radoop.

Miscellaneous

Property	Internal key	Default value	Description
HDFS directory	hdfs_directory	/tmp/radoop/	Defines the path to the location where RapidMiner Radoop stores temporary files on the HDFS directory cluster. The user running RapidMiner Radoop must have permission to create this directory if it does not exist, and must have read and write permission on the directory. Furthermore, read permission is required for the user connecting to the Hive database. Please note that locating this directory in an encrypted zone (different from Hive warehouse directory) is not supported.
Table prefix	table.prefix	Radoop_	Defines the default Hive temporary table prefix for new processes. You can override this prefix using the Radoop Nest operator parameter of a given process so that users can easily differentiate their temporary objects on the cluster.
Auto describe	auto_describe	disabled	Toggles whether to automatically describe all Hive objects after connection or refresh. If enabled, this property saves the state of the toggle button on the Hadoop Data view. All metadata of your Hive objects are immediately fetched, which can be slow if there are many objects.
Describe max errors	describe.max_errors	5	Sets the threshold for errors. The Hadoop Data view considers a connection failed if it encounters, during describing Hive objects, more errors than this limit. You may have to increase this value if, for example, you have many Hive objects erroring when described (e.g., missing custom input/output format classes).
Automatic cleaning interval	cleaning_interval	5	Interval, in days, for the Radoop automatic cleaning service. Radoop cleans every temporary table, file, and directory older than the given threshold. Setting it zero disables automatic cleaning.
Spark Memory Monitor lookback seconds	spark.lookbacksecs	300	Window size in number of seconds the Spark Garbage Collection usage monitor will analyze.
Spark Memory Monitor GC Treshold	spark.gctreshold	0.98	If this percentage of the time in lookback seconds is spent with Garbage Collection, the Memory Monitor will kill the process.
Connection Pool Size	connection_pool_size	8	Size of the Hive JDBC connection pool. Increase it if you want to run many operations in parallel (e.g. on RapidMiner Server).

Sample size

Property	Internal key	Default value	Description
Sample size overall	sample_size.overall	200000	Sets the sample size for Hadoop data sets on nest output. When data arrives onto the output of the Radoop Nest, it is fetched into the client machine's memory. Use this value to limit the size of the data (sample). A value of 0 means full sample.
Sample size breakpoint	sample_size.breakpoint	1000	Sets the sample size for a Hadoop data set after a breakpoint in a process, and in the Hadoop Data view. When you pause a RapidMiner Radoop process using a breakpoint, a sample of the processed data is fetched into the client machine's memory to be manually examined. Use this value to define the number of rows in the sample. The Hadoop Data View also uses this limit when exploring tables. A value of 0 means full sample.

Timeout values

Property	Internal key	Default value	Description
Connection timeout	connection.timeout	30	Sets the timeout, in seconds, for the connection. This setting defines the time after which Radoop may cancel a connection test (and consider it failed). You may want to increase this value if the connection latency is high or if it varies by larger intervals. A value of 0 sets the default value (30 seconds).
Hive command timeout	hive_command.timeout	30	Sets the timeout, in seconds, allowed for simple Hive commands to return. This setting defines the time after which RapidMiner Radoop may cancel an atomic operation on the cluster. Increase this value if the connection latency is high or if it varies by larger intervals. A value of 0 sets the default value (30 seconds).
Log collection timeout	log_collection.timeout	30	Sets the timeout, in seconds, for the collection of YARN aggregated logs. Zero disables the feature. Turning off this feature is recommended if the YARN log aggregation is disabled for your cluster.

Fileformats

Property	Internal key	Default value	Description
Fileformat Hive	fileformat.hive	Default format	Specifies the storage format for Hive connections. The storage format is generally defined by the Radoop Nest hive_file_format parameter, but this property sets a default for the parameter in new Radoop Nests. It also defines the default settings for new table imports on the Hadoop Data view. 'Default format' means to use the Hive server default (usually TEXTFILE).
Fileformat Impala	fileformat.impala	Default format	Specifies the storage format for Impala connections. The storage format is generally defined by the Radoop Nest impala_file_format parameter, but this property sets a default for the parameter in new Radoop Nests. It also defines the default settings for new table imports on the Hadoop Data view. 'Default format' means use the Impala default (usually TEXTFILE).

Logging

Property	Internal key	Default value	Description
Enable log4j logging	log4j	disabled	Determines if log4j logs should be collected into the user folder.
Log4j properties file	log4j.properties		If log4j log collection is enabled and you wish to use your own log4j.properties file, define its location here. The file must contain the 'log4j.rootLogger' property which defines the logging level and the appenders to attach.

JDBC Connection Pool

Property	Internal key	Default value	Description
Connection pool size	connection_pool.fast_statement.size	8	Size of the Hive JDBC connection pool. Increase it if you want to run many operations in parallel (e.g. on RapidMiner Server).
Connection pool timeout	connection_pool.fast_statement.timeout	85	Timeout for waiting for available connection (seconds).

Container pool

Property	Internal key	Default value	Description
Hive on Spark container pool fixed size	connection_pool.container.size	0	Sets the maximum number of Hive on Spark application that Radoop can use. If set to 0, an estimated number of containers will be used based on cluster resources.
Hive on Spark container pool timeout	connection_pool.container.timeout	0	Timeout for waiting for available container (seconds). Use 0 to wait indefinitely for resources.
Hive on Spark / Hive on Tez container idle time	connection_pool.container.idle_time	30	Time after idle Hive on Spark / Hive on Tez containers will be closed (seconds). Use 0 to disable closing idle containers.

Categories

Versions