You are viewing the RapidMiner Studio documentation for version 2024.0 - Check here for latest version
In-Database Processing extension
Introduction
When processing remote data using Altair AI Studio or Altair AI Hub, there are, broadly speaking, three different approaches:
You can download the data from the remote repository and process it in AI Studio / AI Hub, before writing it back again, using operators such as Read Amazon S3, Read Google Storage, Write Database, and Write Salesforce.
You can process the data within the remote repository, using the repository's native tools:
The Execute SQL operator executes SQL directly in a remote database.
The In-Database Processing extension allows you to create native AI Studio processes that are automatically translated into the query language of the remote repository, whether it be Google BigQuery, Oracle, Snowflake, or any other of the supported services.
This document will concentrate on the third method, In-Database Processing.
Supported services
Currently the In-Database Processing extension supports the following services:
- Databricks
- Embedded
- Google BigQuery
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
- Snowflake
We are continuously adding support for other services based on popular demand.
Note that the extension provides standard SQL support for any SQL database, even if the database is not included in the above list. The differences between supported SQL databases and unsupported SQL databases are the following:
- Supported SQL databases (and indeed any supported services, not just SQL) appear in the SQL dialect parameter list in the In Database Nest operator. Unsupported SQL databases appear as other.
- With supported SQL databases, non-standard functions (e.g.
string_agg
in PostgreSQL) appear in the function list for the operator Generate Attributes (In Database). With unsupported SQL databases, only standard SQL functions are available.
Note too that it is possible to write your own in-database extension.
Although Oracle is now a supported database within the In-Database Processing extension, we provide a sample extension for Oracle DB on github.
Advantages of in-database processing
Moving large data sets over the internet can be both slow and costly.
With the In-Database Processing extension, you can save both time and money by leveraging the computing power and efficiency of your data storage provider, with the following additional benefits:
- you reduce network congestion
- you limit local memory usage
- you keep your data securely in one place
- you can chain operators as you normally would in AI Studio, and the extension translates this chain into a complex query that is then executed remotely. In short, it's an SQL query builder.
Table of Contents
In Database Nest
Synopsis
This meta operator allows you to define a subprocess that extracts data from a database.
Description
This operator translates the steps that its subprocess defines to SQL, and submits that code into the selected database. The operator specifies the database connection to use. Only operators from the In-Database Processing operator group can be used inside this meta operator. The operator has no input ports, but it can have arbitrary number of outputs. Each of those outputs delivers the in-memory data set result of the query generated by the operators connected to that port. If you enable parallelization, these queries may run concurrently in you database. The queries are submitted in the natural order of the output ports. The generated query is available as an Annotation for each data set and can be further processed with Annotations to Data or other Annotation operators.
Macros
There are predefined macros available inside the Nest to be used in parameters. Please check the following list.
- process_start_db: process start date and time in a database-specific format based on the selected database type - note that the value could be a quoted literal, or a date function expression as well
- t_db: current date and time in a database-specific format based on the selected database type - note that the value could be a quoted literal, or a date function expression as well
- db_id: database type, that is, the SQL dialect used (either chosen manually or autodetected)
Read more: In Database Nest
Data Access
Retrieve
This operator defines the source table for the query that is built by subsequent In-Database Processing operators.
Read more: Retrieve
Store
This operator writes the input data into the database. Note that this operator requires proper CREATE TABLE privileges in the selected schema. No data is retrieved to the memory of the machine running this process unless the output of this operator is connected to a Nest output. This way, you can implement ETL processes without worrying about running out of memory, since the entire process will be pushed into the database.
Read more: Store
Blending (16)
Aggregate
This operator performs the well known GROUP BY operation using the specified list of grouping attributes. It then applies the specified aggregate functions on the selected columns. The aggregate functions can be defined in two different ways: either explicitly in a list, or by using the attribute selector and specifying a default aggregation function for these columns. The two methods can be combined. Note that the known HAVING clause from SQL can be simulated by an additional Filter Examples operator following this one.
Read more: Aggregate
Convert Type
Converts the data type of the chosen column. You can specify a default conversion that is applied on all columns that the attribute filter selects. You can also specify individual column data type conversions. The suggested data types are database-specific, and you can manually enter types that are not in the drop-down list. Note that for certain types, you may further specify the type, e.g. DECIMAL(8,2), NUMERIC(6,3), VARCHAR2(150). Data types in the output metadata are only available for these database-specific values, otherwise, the generic attribute_value is used.
Read more: Convert Type
Custom Query
This operator allows to run a custom SQL query as a step in the process. The query may use multiple number of inputs. Use input1, input2, etc. in the query to refer to the inputs. In the SELECT expression, use t1, t2 aliases for referring to these inputs. Take a look at an example query for using multiple inputs:
SELECT t2.* FROM input1 JOIN input2 ON t1.a1 = t2.a1
Note that the metadata on the output port does not automatically reflect the expected result of the specified SQL query. By default, the metadata on the output port simply copies the metadata on the first input port. To change this, use the metadata parameter, and set the expected metadata manually, so that you can configure subsequent operators more easily. Alternatively, simply click on the Read Metadata button to populate this parameters by making the database parse the specified query. Note that you must click on this button every time you change the SQL code and expect to get different result back.
Read more: Custom Query
Filter Example Range
This operator allows you to only keep a smaller set of rows. You can specify the number of rows and the offset, the latter defines at which row the sample should start. Note that the sample is not a random sample, but consists of subsequent rows of the data set.
Read more: Filter Example Range
Filter Examples
This operator filters the set of rows based on the criteria defined either by an expression (where you can use your database-specific functions as well), or by a combination of simple expressions. A dialog that you can access by clicking on the appropriate button helps in both cases.
When constructing the filter expression, you may find the built-in macros like process_start_db and t_db useful. Please check the Nest operator help for more information.
Read more: Filter Examples
Generate Attributes
This operator allows you to generate new columns, or override existing ones. New columns appear as last columns, overridden ones keep their place in the column list. The Expression Editor Dialog allows you to easily construct the column expressions. It shows you the database-specific built-in functions, columns and constants as well. Note that the operator cannot check the syntax of the expression, you only get an error for a wrong expression during process execution. The function list and function descriptions shown only reflect one particular version of the selected database type, their goal is just to help in building expressions with a UI. Otherwise, your database version may easily offer somewhat different functions or syntax.
When constructing the attribute expressions, you may find the built-in macros like process_start_db and t_db useful. Please check the Nest operator help for more information.
Read more: Generate Attributes
Generate Rank
The rank of a row is one plus the count of ranks before the given row. The dense rank of a row is one plus the count of distinct ranks before the given row. Note that if no partitioning is defined, the entire data set is sorted, a potentially slow operation for a large data set. For adding a unique ID variable to the data set, you may want to use your database-specific function instead, e.g. ROWNUM for Oracle.
Read more: Generate Rank
Join
This operator implements the well-known (equi-)join operation. It can be inner, left, right or outer join. The list of keys can be defined in a parameter. The operator can remove duplicate columns (based on their name), or keep them.
Read more: Join
Rename
This operator can rename one or more columns of the data set.
Read more: Rename
Rename by Replacing
This operator replaces parts of the column names (like whitespaces, parentheses, or other unwanted characters) by a specified replacement. The replace_what parameter can be defined as a regular expression (please refer to the tutorial for a description). The replace_by parameter can be defined as an arbitrary string. Empty strings are also allowed. Capturing groups of the defined regular expression can be accessed with $1, $2, $3...
Read more: Rename by Replacing
Reorder Attributes
This operator can define the order of the attributes. It is also possible to define an attribute order for a subset of attributes (regular expressions can be used) and SELECT only them and remove the unmatched attributes. You can also to sort the unmatched attributes and append or prepend them to the specified attribute list.
Read more: Reorder Attributes
Replace
This operator replaces parts of the string values of all nominal attributes it is applied on. The attribute filter type gives the possibility to restrict them. For each value of each attribute it is checked if the regular expression of replace what matches the string. Each matching part of the string will be replaced by the value of the replace_what parameter. The replacement might be empty and can contain capturing groups.
Please keep in mind that certain databases may not support regular expressions, in that case a design-time warning will appear if special characters are used. Also note that regular expressions may differ between different database engines. Please consult the documentation of your database regarding more complicated examples.
Note that the output data type for the columns with replaced values may depend on your database (e.g. LONGTEXT in case of MySQL that becomes Text in RapidMiner). In other cases, the replace expression may cause that the value becomes too large for the default return data type, in which case the result may be erroneous (e.g. max 4000 characters in Oracle).
Examples The attribute contains the values "color red", "color green" and "color blue".
- replacing "color" by "" yields: " red", " green", " blue"
- replacing "color" by "colour" yields: "colour red", "colour green", "colour blue"
- replacing "color\s" by "" yields: "red", "green", "blue"
- replacing "\s+" by "_" yields: "color_red", "color_green", "color_blue"
Read more: Replace
Sample
This operator is similar to the Filter Example Range operator. The number of examples in the sample can be specified on absolute or probability basis depending on the setting of the sample parameter. In case of absolute sample, it is possible to define the exact number of rows to be returned. While in case of probability, the required parameter is the sample probability, which is in [0,1] and defines the returned size of the rows compared to all rows. If it is zero, it is equal to setting the absolute sample to 0. If it is 1, the input is returned. In some databases it is possible to define a seed for the random sample generation. Otherwise, the result may not to be deterministic.
Read more: Sample
Select Attributes
This operator only keeps certain columns according to its parameters.
Read more: Select Attributes
Sort
This operator re-orders the rows according to the specified parameters. You can choose multiple sorting attributes and directions (ascending, descending). Note that sorting is typically one of the most costly database operations in terms of time and memory.
Read more: Sort
Union
This operator combines two input data sets with the same or compatible schema, by appending the rows of the second input to the rows of the first input. Duplicate rows can be removed, but they are kept by default.
Read more: Union
Cleansing (3)
Declare Missing Value
The given value will be replaced with NULL throughout the specified subset, so it will be treated as a missing value by subsequent operators.
Read more: Declare Missing Value
Remove Duplicates
This operator only keeps distinct rows in the data set. In other words, it only keeps one instance of duplicate rows.
Read more: Remove Duplicates
Replace Missing Values
Replaces missing values in examples. If a value is missing, it is replaced by one of the functions "minimum", "maximum", "average", and "none", which is applied to the non missing attribute values of the example set. "none" means that the value is not replaced. The function can be selected using the parameter list columns. If an attribute's name appears in this list as a key, the value is used as the function name. If the attribute's name is not in the list, the function specified by the default parameter is used. The replenishment "value" indicates that the user defined parameter should be used for the replacement. It can be a nominal or a numerical value.
Read more: Replace Missing Values
Utility (2)
Multiply
This operator can be used to start multiple branches from this point. The same data set can be processed differently on different branches. Note that as this operator does not write temporary data to the database, the query in the input will be re-executed for every output. Of course, the optimizer of the database almost certainly improves on this.
Read more: Multiply
Subprocess
A simple operator chain which can have an arbitrary number of inner operators. The operators are subsequently applied and their output is used as input for the succeeding operator. The input of the operator chain is used as input for the first inner operator and the output of the last operator is used as the output of the operator chain.
Read more: Subprocess