Stream Database (JDBC Connectors)

Synopsis

This operator reads an ExampleSet from an SQL database by incrementally caching it (recommended).

Description

The Stream Database operator is used for reading an ExampleSet from the specified SQL database. You need to have at least basic understanding of databases and database connections in order to use this operator properly. Please go through the parameter description and the attached Example Process to understand the working of this operator.

This operator reads an ExampleSet from an SQL database. The data is loaded from a single table which is defined by the table name parameter. Please note that table and column names are often case sensitive. The most convenient way of defining the necessary parameters is through the configuration wizard. The most important parameters (database URL and username) will be automatically determined by this wizard. You can define the special attributes like labels, ids and weights through corresponding parameters.

In contrast to the Database operator, which loads the data into the main memory, this operator keeps the data in the database and performs the data reading in batches. This allows RapidMiner to access data sets of arbitrary sizes without any size restrictions.

Please note the following important restrictions and notes:

  • Only manifested tables (no views) are allowed as the base for this data caching operator.
  • If primary key and index are not present, a new column named RM_INDEX is created and it is automatically used as the primary key
  • If a primary key is already present in the specified table, a new table named RM_MAPPED_INDEX is created which maps a new index column RM_INDEX to the original primary key.
  • The users can provide the primary key column RM_INDEX themselves. This column should be an integer valued index attribute, counting should start from 1, without any gaps or missing values for all rows.

Besides the new index column or the creation of mapping the table, no writing actions are performed in the database. Moreover, data sets built on top of a cached database table do not support writing actions at all. Users have to materialize the data, change it, and write it back into a new table of the database (e.g. with the Write Database operator).

Differentiation

Execute SQL

The Stream Database operator is used for reading data from a database. The Execute SQL operator cannot be used for reading data from databases. It can be used for executing SQL statements like CREATE or ADD etc on the database.

Read Database

In contrast to the Read Database operator, which loads the data into the main memory, the Stream Database operator keeps the data in the database and performs the data reading in batches. This allows RapidMiner to access data sets of arbitrary sizes without any size restrictions.

Output

  • output (Data Table)

    This port delivers the database table in form of an ExampleSet along with the meta data. This output is similar to the output of the Retrieve operator.

Parameters

  • define_connectionThis parameter indicates how the database connection should be specified. The following options are available: predefined, url and jndi. Range: selection
  • connectionThis parameter is only available when the define connection parameter is set to 'predefined'. This parameter is used for connecting to the database using a predefined connection. You can have many predefined connections. You can choose one of them using the drop down box. You can add a new connection or modify previous connections using the button next to the drop down box. You may also accomplish this by clicking on the Manage Database Connections... from the Tools menu in the main window. A new window appears. This window asks for several details e.g. Host, Port, Database system, schema, username and password. The Test button in this new window will allow you to check whether the connection can be made. Save the connection once the test is successful. After saving a new connection, it can be chosen from the drop down box of the connection parameter. You need to have basic understanding of databases for configuring a connection. Range: string
  • database_systemThis parameter is only available when the define connection parameter is set to 'url'. This parameter is used for selecting the database system in use. It can have one of the following values: MySQL, PostgreSQL, Sybase, HSQLDB, ODBC Bridge (e.g. Access), Microsoft SQL Server (JTDS), Ingres, Oracle. Range: selection
  • database_urlThis parameter is only available when the define connection parameter is set to 'url'. This parameter is used for defining the URL connection string for the database, e.g. 'jdbc:mysql://foo.bar:portnr/database'. Range: string
  • usernameThis parameter is only available when the define connection parameter is set to url. This parameter is used to specify the username of the database. Range: string
  • passwordThis parameter is only available when the define connection parameter is set to 'url'. This parameter is used for specifying the password of the database. Range: string
  • jndi_nameThis parameter is only available when the define connection parameter is set to 'jndi'. This parameter is used for specifying the JNDI name for a data source. Range: string
  • table_nameThis parameter is used for selecting the required table from the specified database. Range: string
  • recreate_indexThis parameter indicates if recreation of the index or index mapping table should be forced. Range: boolean
  • label_attributeThe name (case sensitive) of the label attribute is specified through this parameter. Range: string
  • id_attributeThe name (case sensitive) of the id attribute is specified through this parameter. Range: string
  • weight_attributeThe name (case sensitive) of the weight attribute is specified through this parameter. Range: string

Tutorial Processes

Reading an ExampleSet from a mySQL database

The Stream Database operator is used in this Example Process for reading a mySQL database. The define connection parameter is set to predefined. The define connection parameter was configured using the button next to the drop down box. The name of the connection was set to 'mySQLconn'. The following values were set in the connection parameter's wizard. The Database system was set to 'mySQL'. The Host was set to 'localhost'. The Port was set to '3306'. The Database scheme was set to 'golf'; this is the name of the database. The User was set to 'root'. No password was provided. You will need a password if your database is password protected. Set all the values and test the connection. Make sure that the connection works.

The table name parameter is set to 'golf_table' which is the name of the required table in the 'golf' database. The label attribute parameter is set to 'Play'. Run the process, you will see the entire 'golf_table' in the Results Workspace.