Read Excel (RapidMiner Studio Core)

Synopsis

This operator reads an ExampleSet from the specified Excel file.

Description

This operator can be used to load data from Microsoft Excel spreadsheets. This operator is able to read data from Excel 95, 97, 2000, XP, and 2003. The user has to define which of the spreadsheets in the workbook should be used as data table. The table must have a format such that each row is an example and each column represents an attribute. Please note that the first row of the Excel sheet might be used for attribute names which can be indicated by a parameter. The data table can be placed anywhere on the sheet and can contain arbitrary formatting instructions, empty rows and empty columns. Missing data values in Excel should be indicated by empty cells or by cells containing only "?".

For complete understanding of this operator read the parameters section. The easiest and shortest way to import an Excel file is to use the import configuration wizard from the Parameters panel. The best way, which may require some extra effort, is to first set all the parameters in the Parameters panel and then use the wizard. Please make sure that the Excel file is read correctly before building a process using it.

Input

file (File)
An Excel file is expected as a file object which can be created with other operators with file output ports like the Read File operator.

Output

output (IOObject)
This port delivers the Excel file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.

Parameters

import_configuration_wizardThis option allows you to configure this operator by means of a wizard. This user-friendly wizard makes the use of this operator easy. Range:
excel_fileThe path of the Excel file is specified here. It can be selected using the choose a file button. Range:
sheet_selectionThis option allows you to change the sheet selection between sheet number and sheet name. Range:
sheet_numberThe number of the sheet which you want to import should be specified here. Range: integer
sheet_nameThe name of the sheet which you want to import should be specified here. Range: string
imported_cell_rangeThis is a mandatory parameter. The range of cells to be imported from the specified sheet is given here. It is specified in 'xm:yn' format where 'x' is the column of the first cell of range, 'm' is the row of the first cell of range, 'y' is the column of the last cell of range, 'n' is the row of the last cell of range. 'A1:E10' will select all cells of the first five columns from row 1 to 10. Range:
first_row_as_namesIf this option is set to true, it is assumed that the first line of the Excel file has the names of attributes. Then the attributes are automatically named and the first line of Excel file is not treated as a data line. Range: boolean
annotationsIf the first row as names parameter is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment. Range:
date_formatThe date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in an Excel file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in the date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of date type. Range:
time_zoneThis is an expert parameter. A long list of time zones is provided; users can select any of them. Range:
localeThis is an expert parameter. A long list of locales is provided; users can select any of them. Range:
read_all_values_as_polynominalThis option allows you to disable the type handling for this operator. Every column will be read as a polynominal attribute. To parse an excel date afterwards use 'date_parse(86400000 * (parse(date_attribute) - 25569))' (- 24107 for Mac Excel 2007) in the Generate Attributes operator. Range: boolean
data_set_meta_data_informationThis option is an important one. It allows you to adjust the meta data of the ExampleSet created from the specified Excel file. Column index, name, type and role can be specified here. The Read Excel operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter. Range:
read_not_matching_values_as_missingsIf this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) or an empty cell in the Excel file is also read as a missing value. Range: boolean
data_management This is an expert parameter. A long list is provided; users can select any option from this list. Range:

Tutorial Processes

Reading an ExampleSet from an Excel file

In this Example Process we first store the 'Golf' data set under '%{tempdir}/golf.xlsx'. The data set was copied on sheet 1 of the Excel file thus the sheet number parameter is given value 1. The first cell of the sheet is A1 and last required cell is E15, thus the imported cell range parameter is provided value 'A1:E15'. As the first row of the sheet contains names of attributes, the first row as names parameter is checked. The remaining parameters were used with default values. Run the process, you will see almost the same results as you would have gotten from using the Retrieve operator to retrieve the 'Golf' data set from the Repository. You will see a difference in the meta data though, for example here the types and roles of attributes are different from those in the 'Golf' data set. You can change the role and type of attributes using the data set meta data information parameter. It is always good to make sure that all attributes are of desired role and type. In this example one important change that you would like to make is to change the role of the Play attribute. Its role should be changed to label if you want to use any classification operators on this data set.

Categories

Versions