Categories

Versions

You are viewing the RapidMiner Studio documentation for version 9.7 - Check here for latest version

Read Database (JDBC Connectors)

Synopsis

This operator reads an ExampleSet from a SQL database.

Description

The Read Database operator is used for reading an ExampleSet from the specified SQL database. You need to have at least basic understanding of databases, database connections and queries in order to use this operator properly. Go through the parameters and Example Process to understand the flow of this operator.

When this operator is executed, the table delivered by the query will be copied into the memory of your computer. This will give all subsequent operators a fast access on the data. Even learning schemes like the Support Vector Machine with their high number of random accesses will run fast.

The java ResultSetMetaData interface does not provide information about the possible values of nominal attributes. The internal indices the nominal values are mapped to, will depend on the ordering they appear in the table. This may cause problems only when processes are split up into a training process and a testing process. This is not a problem for learning schemes which are capable of handling nominal attributes. If a learning scheme like the SVM is used with nominal data, RapidMiner pretends that nominal attributes are numerical and uses indices for the nominal values as their numerical value. The SVM may perform well if there are only two possible values. If a test set is read in another process, the nominal values may be assigned different indices, and hence the SVM trained is useless. This is not a problem for the label attributes, since the classes can be specified using the classes parameter and hence all learning schemes intended to use with nominal data are safe to use. You might avoid this problem if you first combine both ExampleSets using the Append operator and then split it again using two Filter Examples operators.

Differentiation

Execute SQL

The Read Database operator is used for loading data from a database into RapidMiner. The Execute SQL operator cannot be used for loading data from databases. It can be used for executing SQL statements like CREATE or ADD etc on the database.

Input

  • connection (Connection)

    This input port expects a Connection object if any. See the parameter connection entry for more information.

Output

  • output (IOObject)

    This port delivers the result of the query on database in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.

  • connection (Connection)

    This output port delivers the Connection object from the input port. If the input port is not connected the port delivers nothing.

Parameters

  • define_connectionThis parameter indicates how the database connection should be specified. It gives you four options: repository, predefined, url and jndi. The parameter is not visible if the connection input port is connected. Range: selection
  • connection_entryThis parameter is only available when the define connection parameter is set to repository. This parameter is used to specify a repository location that represents a connection entry. The connection can also be provided using the connection input port. Range: string
  • connectionThis parameter is only available when the define connection parameter is set to predefined. This parameter is used to connect to the database using a predefined connection. You can have many predefined connections. You can choose one of them using the drop down box. You can add a new connection or modify previous connections using the button next to the drop down box. You may also accomplish this by clicking on the Manage Database Connections... from the Tools menu in the main window. A new window appears. This window asks for several details e.g. Host, Port, Database system, schema, username and password. The Test button in this new window will allow you to check whether the connection can be made. Save the connection once the test is successful. After saving a new connection, it can be chosen from the drop down box of the connection parameter. You need to have basic understanding of databases for configuring a connection. Range: string
  • database_systemThis parameter is only available when the define connection parameter is set to url. This parameter is used to select the database system in use. It can have one of the following values: MySQL, PostgreSQL, Sybase, HSQLDB, ODBC Bridge (e.g. Access), Microsoft SQL Server (JTDS), Ingres, Oracle. Range: selection
  • database_urlThis parameter is only available when the define connection parameter is set to url. This parameter is used to define the URL connection string for the database, e.g. 'jdbc:mysql://foo.bar:portnr/database'. Range: string
  • usernameThis parameter is only available when the define connection parameter is set to url. This parameter is used to specify the username of the database. Range: string
  • passwordThis parameter is only available when the define connection parameter is set to url. This parameter is used to specify the password of the database. Range: string
  • jndi_nameThis parameter is only available when the define connection parameter is set to jndi. This parameter is used to give the JNDI a name for a data source. Range: string
  • define_queryQuery is a statement that is used to select required data from the database. This parameter specifies whether the database query should be defined directly, through a file or implicitly by a given table name. The SQL query can be auto generated giving a table name, passed to RapidMiner via a parameter or, in case of long SQL statements, in a separate file. The desired behavior can be chosen using the define query parameter. Please note that column names are often case sensitive and might need quoting. Range: selection
  • queryThis parameter is only available when the define query parameter is set to query. This parameter is used to define the SQL query to select desired data from the specified database. Range: string
  • query_fileThis parameter is only available when the define query parameter is set to query file. This parameter is used to select a file that contains the SQL query to select desired data from the specified database. Long queries are usually stored in files. Storing queries in files can also enhance reusability. Range: filename
  • table_nameThis parameter is only available when the define query parameter is set to table name. This parameter is used to select the required table from the specified database. Range: string
  • prepare_statementIf checked, the statement is prepared, and '?' can be filled in using the parameters parameter. Range: boolean
  • parametersParameters to insert into '?' placeholders when statement is prepared. Range: enumeration

Tutorial Processes

Reading ExampleSet from a mySQL database

The Read Database operator is used to read a mySQL database. The define connection parameter is set to predefined. The define connection parameter was configured using the button next to the drop down box. The name of the connection was set to 'mySQLconn'. The following values were set in the connection parameter's wizard. The Database system was set to 'mySQL'. The Host was set to 'localhost'. The Port was set to '3306'. The Database scheme was set to 'golf'; this is the name of the database. The User was set to 'root'. No password was provided. You will need a password if your database is password protected. Set all the values and test the connection. Make sure that the connection works.

The define query parameter was set to 'table name'. The table name parameter was set to 'golf_table' which is the name of the required table in the 'golf' database. Run the process, you will see the entire 'golf_table' in the Results Workspace. The define query parameter is set to 'table name' if you want to read an entire table from the database. You can also read a selected portion of the database by using queries. Set the define query parameter to 'query' and specify a query in the query parameter. One sample query is already defined in this example. This query reads only those examples from 'golf_table' where the 'Outlook' attribute has the value 'sunny'.