Categories

Versions

You are viewing the RapidMiner Studio documentation for version 2024.0 - Check here for latest version

Database Best Practices and Useful Knowledge

Altair RapidMiner and SQL Databases

Altair AI Studio and Altair AI Hub access relational (SQL) databases with the JDBC driver standard. Any database system that has a JDBC driver can be used with Altair RapidMiner. The information on this page refers to AI Studio and AI Hub unless otherwise noted, as they both share the same engine.

There are other types of databases available, often called “NoSQL” databases. Some of these are also supported by Altair RapidMiner, but this page is not relevant for them.

Table of contents

Relational Databases and the SQL Standard

Although there is a standard Structured Query Language (SQL), there are many differences between database systems. Switching databases is not always easy and Altair RapidMiner tries to abstract differences by only using standard SQL in many operations. But the portability is limited with hand-written SQL queries, special data types and advanced functionality.

Some areas of interest are:

  • Object naming conventions - Some database systems convert object (table, column, …) names to UPPER or lower case. But they often accept double-quoted object names and preserve their capitalization. This is what Altair RapidMiner does when constructing SQL statements, e. g. for writing into databases.

  • Data types - Some database systems only support long strings with special data types (e. g. “VARCHAR2”), some do it transparently. Some compare strings case insensitively, others expect exact matches. Some support timestamp with or without time zone information. Some have native support for XML and JSON documents.

  • Sorting - While sorting numbers should be well defined, sorting strings is not. Many database systems use the operating system for sorting strings and use the current locale setting. Others support sorting rules to be specified inside the database, even different ones for tables or single queries. Don’t expect that different databases sort the same data in the same way.

  • Functions - This is the largest area of differences between database systems. While the SQL standard offers a lot of common functionality, many database systems implemented additional functions that might be easier to use or do something more advanced. Using non-standard functions will affect portability between databases.

Connecting to Databases

In every repository or project there is a folder called Connections. The user creates connection objects in this folder. Each connection object contains the configuration information and the entire JDBC driver. This makes connections portable between systems, without the need for maintaining a common set of JDBC drivers.

Altair RapidMiner contains many freely available and redistributable JDBC drivers in the lib/jdbc subfolder of the AI Studio installation path. Additional drivers can be downloaded as jar files from the vendor’s websites and the connection can be pointed to them. It is a good idea to have a common folder for downloaded JDBC driver files.

Embedded Databases

Users without access to an external database server provided by, for example, corporate IT can run their own databases inside Altair RapidMiner. Two of these systems are even available in the default AI Studio installation, others can be downloaded and used with AI Studio.

Database How to get Remarks
HyperSQL (HSQLDB) Included in AI Studio. Add a new connection with the type HSQLDB. User name: SA

Custom connection string: jdbc:hsqldb:file:/path/to/file
H2 JDBC driver in “lib” inside the AI Studio installation path. Add a custom database connection and select the driver in the file browser. Custom connection string:

jdbc:h2:file:/path/to/file
SQLite Download the driver JAR file and add a custom database connection with it. Custom connection string:

jdbc:sqlite:/path/to/file

The database files are automatically created upon first connecting to the database. The data are stored in one or a few files that can be freely shared with other people.

If you are searching for a database server for a team and have the resources to maintain such a service, PostgreSQL is a good free choice as a well-supported open source database with advanced features for analytics.

Operators for Database Access

Read Database

The Read Database operator connects to a database and executes an SQL query, expecting it to return tabular data. The SQL query can be specified interactively in AI Studio, read from a file, or created by Altair RapidMiner by pointing to a table name. (Views, materialized views and other tabular object types in some database systems are also “tables” or “relations” in this context.)

The Build SQL Query dialog lists the tables and views in the selected database and the attributes in the selected object, this makes creating SQL queries easier.

Avoiding SQL Injection

SQL injection is an attack vector against software exposed to people with malicious intent. While this is not a frequent threat against Altair RapidMiner projects, such problems can still mess up processes and cause problems in processing data.

The following query illustrates the possibility of SQL injection:

SELECT * FROM Customers WHERE Lastname = '<query parameter>'

This searches for a customer with the given last name. The query parameter would be directly inserted into the statement sent to the database. Without any malicious intent, just by searching for a customer called O’Connor (or any other last name containing the single quote character), calling this SQL statement with the query parameter will lead to a syntax error.

This is easy to avoid. Read Database has a parameter called prepare statement (activate Show advanced parameters if you don’t see it). This enables a parameter dialog in which you can add your parameters and specify their types. This way of passing parameters to the SQL statements avoids unexpected syntax errors in the database and helps identifying problems with data that are in unexpected formats (e. g. a string instead of a number).

With prepared statements the query is written as this:

SELECT * FROM Customers WHERE Lastname = ?

The question mark refers to the first parameter entered in the dialog, a second one to the second parameter, and so on.

Sharing the Connection

Read Database, just as the other database operators, has optional “con” (connection) inputs and outputs. Some processes with multiple database operators can be simplified by retrieving the connection entry in a first step and sharing it among the database operators by passing it between them. This is also a good way to create processes that can work with different database environments (e. g. development, test and production) by passing the connection from outside or using parameters.

Write Database

The Write Database operator takes tabular data and stores them in a database. The table schema and table name can be specified.

It is important to consider the correct setting of the overwrite mode parameter. By default the operator creates the target table as necessary and changes its structure to match the incoming data. This can be the expected behaviour for a single data scientist working alone in a database, but it is absolutely not ideal for other scenarios. Often, the database structure is fixed and can’t even be changed by the Altair RapidMiner user.

In these situations append is the correct overwrite mode. If the table needs to be empty before writing, one can use Execute SQL with a statement like DELETE FROM tablename before Write Database.

Many database tables use generated primary keys like numeric ID values that the database system generates for new records. The parameter add generated primary keys gets these values after the insert operation and adds them to the table.

Depending on the database type, set default varchar length might be necessary in situations where the table structure is created or updated by Altair RapidMiner. By default Altair RapidMiner tries to use a text data type with a large length, if such a type exists in the database system.

In the beginning of developing a process and with full access to the database, the standard behaviour of always recreating the table with the structure matching the incoming data can be helpful. Later it is often better to freeze the table structure, perhaps add some indexes for faster queries, and use the append overwrite mode.

The batch size parameter can be used to improve the write speed of Write Database. Altair RapidMiner groups the records to insert in one step into batches with the size of batch size and submits them together in one transaction. Enterprise databases might be able to write the data faster with much higher settings (e. g. 500 or 1000) compared to the default setting of 100.

Update Database

The Update Database operator takes a data table, tries to find data in the database matching the input rows, and updates or adds them.

The default setting of using all attribute values to identify the row to update is almost never the right one. Usually one would select an ID attribute (attribute filter type = single) or a combination of attributes.

The lookup is done row for row. This has large performance implications.

With larger database tables it is important to have proper indexes on the table. Tables created by Write Database usually don’t have these, so it’s a good idea to add them, referring to the documentation of the database system being used.

There are situations in which an existing index is not used by Altair RapidMiner and the access is still slow. Sometimes the reason is that Altair RapidMiner identifies the ID column as a floating-point numeric value, but the index is integer. Some database systems don’t convert automatically between these representations, don’t find an index matching the criterion, and revert to full table scans. This can be solved in Altair RapidMiner by explicitly changing the data type to Integer or in the database by creating a matching index.

Database Access Latency Considerations

With the row-by-row access pattern that Update Database uses, network latency becomes an important topic. The latency between different computers is the roundtrip time between network packets which contain requests and the replies to these requests. In local networks we see latencies below a millisecond, while on the Internet and with cloud services they can reach hundreds of milliseconds.

This means that the same process can take a lot of time with Update Database even if nothing changes but the location of the database. Depending on the latency, only a few to about 30 operations per second are possible between different network locations, while the same action could be done with thousands of rows per second in a local network. Obviously, it’s not always possible to freely choose the location of the database and Altair RapidMiner, but this is something to watch out for.

Read Database and Write Database are not affected by latency as strongly as Update Database, as they don’t use the one-by-one access pattern, unless the batch size parameter of Write Database is too low.

If the performance of Update Database is a problem, and latency has been identified as the reason, a possible solution is to use Write Database, put the data into a temporary table, and use SQL statements to insert new and update existing data in the target table. This of course requires some SQL knowledge, which is often available in enterprise settings.

Execute SQL

The Execute SQL operator is available to send arbitrary SQL statements to the database. Like the other scripting operators, it requires some knowledge of SQL and the dialect used in the current database.

Multiple statements can be sent together in one step if the target database supports this. The statements should be separated with semicolons. If this doesn’t work, see the documentation of your database system.

Use cases for Execute SQL include table and index creation, removing data from tables before filling them again, making bulk updates and anything else that can be done in the database by experienced users.

Like Read Database, Execute SQL also supports the parameter submission with prepare statement. Avoiding SQL injection is equally important in executing SQL as it is when reading data.

The In-Database Processing Extension

For complex queries from relational databases, the In-Database Processing Extension is available from the Altair RapidMiner Marketplace.

Its purpose is the graphical modeling of queries and read-write pipelines in databases with the well-known Altair RapidMiner concepts and operators. Essentially, it extends Altair RapidMiner to become a database query builder.

The base operator is In Database Nest. Within the nest, the other operators belonging to the extension can be freely arranged. A common approach is to include the following operators:

The operators inside the In Database Nest are not executed directly. Their arrangement and parameters are used to generate one or more SQL statements that are sent to the database directly. Altair RapidMiner then reads the result of this SQL query. This allows users who are experienced with Altair RapidMiner but don’t know SQL to create complex queries and execute them efficiently in the database. This can be much more efficient compared to reading all data and processing (filtering/generating/joining) in Altair RapidMiner.

The extension has extended support for some database systems, including the full list of their functions in Generate Attributes (In-Database). Other database systems are supported with standard SQL functionality.

Users are often interested in seeing the generated SQL statement. It is available as an annotation of the resulting dataset. In the Results view there is an Annotations tab where the statement can be found and copied.

Database Concepts and their Application in Altair RapidMiner

Transactions

In relational databases, transactions group database operations together, making sure that either none of them or all of them succeed. The canonical example for the necessity of transactions is a bank: Consider a process that subtracts money from account A and adds it to account B. If only the first operations succeeds, somebody loses money that simply disappears.

In SQL, a transaction is started with the BEGIN keyword. Any number of other SQL statements can be executed afterwards. They will be made permanent in the databank with COMMIT. If something broke inside the transaction, it either automatically ends or can be canceled with ROLLBACK.

Altair RapidMiner supports transactions in Execute SQL. The user just needs to write BEGIN; when the transaction is expected to start and COMMIT; at the end.

Operators in Altair RapidMiner are executed separately and after each other. This includes the separate opening of database connections in each operator inside a process. So unfortunately it is not possible to use Execute SQL for starting a transaction, doing some operations like Read Database and Write Database, and then commit the transaction at the end.

Indexes

As mentioned, indexes are a great way to make database queries faster. An index is a special structure inside the database that makes finding information faster than reading through the entire table.

Most database systems support indexes, and some even different types of indexes for different use cases. Information about the details can be found in the database system’s documentation.

For most database types, the following command will create an index on a table column with the default settings:

CREATE INDEX ix_table_column ON tablename(columnname);

When to Create Indexes?

There are some criteria that suggest that creating an index is a good idea, but there is no definitive answer to this question.

  • The column is used in joins or filters (WHERE criteria).

  • The column is used in Update Database (which means that it will be repeatedly used in filters).

  • The column has a good distribution of different values instead of just one or two. The smaller the fraction of the selected data relative to the table, the larger the speed improvement through the index.

  • The table is large.

  • The query is frequently executed.

Don’t create a large number of indexes without an actual need. They will just take up disk space and make both reading and writing data slower.

Altair RapidMiner automatically uses indexes in queries when the database determines that it will improve the query performance. In most database systems it is not possible or advisable to try to force index usage.

Security

Authentication and Access Control

Most enterprise database systems support user authentication, user groups or roles, and fine-grained access control on database objects.

Often, there is a common read-only user account for analysts in such databases. In other scenarios, however, the database is integrated with the enterprise authentication system and users need to authenticate themselves with their own user accounts. Altair RapidMiner supports this use case with injected parameters that are entered when creating database connections.

Firewalls

The traditional rule about database security is to limit access to database systems as much as possible. Many enterprise databases are locked down in this way and they only accept connections from specific networks. For example, users might be able to access them while they are in the office but not from home.

This might mean that the same database connection works while the laptop is in the office network but doesn’t otherwise. In such situations it can be helpful to execute the process on an AI Hub and work with its results.

JDBC Connection Parameters

Sometimes the JDBC driver offers parameters to optimize the behaviour of database connections.

For example, the following options can be set for PostgreSQL:

  • stringtype (default: VARCHAR): better set to "undefined". When using prepared statements, which RM does e. g. in Write Database and Update Database, stringtype=VARCHAR forces many data types to be interpreted as a string by the database. If you change it to stringtype=unspecified, you can write data that has a different but compatible data type in the database. E. g. if you have dates represented as strings in your example set, or booleans with true/false, these can be converted to the actual column type by the database but only with stringtype=unspecified.

  • defaultRowFetchSize (default: 0): by default, the Postgres JDBC driver tries to read the whole query result at once. This matches Altair RapidMiner's behaviour, but setting the value to something like 100000 (number of rows read in one batch) might reduce memory usage in some scenarios. Programs reading sequentially (not Altair RapidMiner) might save a lot of memory with this setting changed to a properly chosen number.

  • currentSchema: PostgreSQL has a concept of a search path (just like the command interpreter on your operating system). E. g. the schema "public" and a schema $username (if it exists) are automatically on the search path. If you want to refer to objects in a particular schema (e. g. datamining) without having to specify the schema, you can set this parameter to the schema name. Be careful: your queries referencing just the tables (without schema) will be only portable to other systems if you make the same change there.

  • ApplicationName: it's polite to set this to something like RapidMiner/MyProject. Database administrators will like you for seeing the application name in their tools and knowing who/what is using the database.

JDBC connection parameters are highly database specific. Don’t expect these settings to work for other databases, but the concepts here might be there, just named differently. Refer to the JDBC driver’s documentation for details.