Categories

Versions

Read Excel (AI Studio Core)

Synopsis

This operator reads an ExampleSet from the specified Excel file.

Description

This operator can be used to load data from Microsoft Excel spreadsheets. This operator is able to read data from Excel 95, 97, 2000, XP, and 2003. The user has to define which of the spreadsheets in the workbook should be used as data table. The table must have a format such that each row is an example and each column represents an attribute. Please note that the first row of the Excel sheet might be used for attribute names which can be indicated by a parameter. The data table can be placed anywhere on the sheet and can contain arbitrary formatting instructions, empty rows and empty columns. Missing data values in Excel should be indicated by empty cells or by cells containing only "?".

For complete understanding of this operator read the parameters section. The easiest and shortest way to import an Excel file is to use the import configuration wizard from the Parameters panel. The best way, which may require some extra effort, is to first set all the parameters in the Parameters panel and then use the wizard. Please make sure that the Excel file is read correctly before building a process using it.

Input

  • file (File)

    An Excel file is expected as a file object which can be created with other operators with file output ports like the Read File operator.

Output

  • output (Data table)

    This port delivers the Excel file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.

Parameters

  • import_configuration_wizardThis option allows you to configure this operator by means of a wizard. This user-friendly wizard makes the use of this operator easy. Range:
  • excel_fileThe path of the Excel file is specified here. It can be selected using the choose a file button. Range:
  • sheet_selectionThis option allows you to change the sheet selection between sheet number and sheet name. Range:
  • sheet_numberThe number of the sheet which you want to import should be specified here. Range: integer
  • sheet_nameThe name of the sheet which you want to import should be specified here. Range: string
  • imported_cell_rangeThe range of cells to be imported from the specified sheet is given here. It is specified in 'xm:yn' format where 'x' is the column of the first cell of range, 'm' is the row of the first cell of range, 'y' is the column of the last cell of range, 'n' is the row of the last cell of range. 'A1:E10' will select all cells of the first five columns from row 1 to 10. Range:
  • use_header_row

    If this parameter is set to true, the header row parameter needs to specify the row with the names of the Attributes. If this is activated no annotations can be used.

    Range: boolean
  • header_row

    Sets the index of the row that should be used for Attribute names. The Attributes names are determined by the content of this row. As a consequence, data will only be read below this row. The parameter data set meta data information overwrites the data from the header row if it is set.

    Range:
  • annotationsIf the first row as names parameter is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment. Range:
  • date_formatThe date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in an Excel file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in the date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of date type. Range:
  • time_zone

    This is an expert parameter. A list of time zones is provided; users can select any of them. Dates (from Excel date columns or text converted dates, see below) are read with the time zone specified in this parameter.

    If text in an Excel file column matches the format provided in the date format parameter, that column is automatically converted to date type (see date format parameter info).

    Range:
  • localeThis is an expert parameter. A long list of locales is provided; users can select any of them. Range:
  • read_all_values_as_polynominalThis option allows you to disable the type handling for this operator. Every column will be read as a polynominal attribute. To parse an Excel date afterwards, use 'date_parse(86400000 * (parse(date_attribute) - 25569))' (- 24107 for Mac Excel 2007) in the Generate Attributes operator. Range: boolean
  • data_set_meta_data_informationThis option is an important one. It allows you to adjust the meta data of the ExampleSet created from the specified Excel file. Column index, name, type and role can be specified here. The Read Excel operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the Excel file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter. Range:
  • read_not_matching_values_as_missingsIf this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) or an empty cell in the Excel file is also read as a missing value. Range: boolean
  • data_management This is an expert parameter. A long list is provided; users can select any option from this list. Range:

Tutorial Processes

Reading an ExampleSet from an Excel file

In this Example Process we first store the 'Golf' data set under '%{tempdir}/golf.xlsx'. The data set was copied on sheet 1 of the Excel file thus the sheet number parameter is given value 1. The first cell of the sheet is A1 and last required cell is E15, thus the imported cell range parameter is provided value 'A1:E15'. As the first row of the sheet contains names of attributes, the first row as names parameter is checked. The remaining parameters were used with default values. Run the process, you will see almost the same results as you would have gotten from using the Retrieve operator to retrieve the 'Golf' data set from the Repository. You will see a difference in the meta data though, for example here the types and roles of attributes are different from those in the 'Golf' data set. You can change the role and type of attributes using the data set meta data information parameter. It is always good to make sure that all attributes are of desired role and type. In this example one important change that you would like to make is to change the role of the Play attribute. Its role should be changed to label if you want to use any classification operators on this data set.