You are viewing the RapidMiner Studio documentation for version 8.0 - Check here for latest version
Read XML (Advanced File Connectors)
Synopsis
This operator is used for reading an XML file.Description
This operator can read XML files, where examples are represented by elements which match a given XPath and features are attributes and text-content of each element and its sub-elements.
This operator tries to determine an appropriate type of the attributes by reading the first few elements and checking the occuring values. If all values are integers, the attribute will become integer, if real numbers occur, it will be of type real. Columns containing values which can't be interpreted as numbers will be nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this attribute will be automatically parsed as date and the according feature will be of type date.
Input
- file
An XML file is expected as a file object which can be created with other operators with file output ports like the Read File operator.
Output
- output (Data Table)
This port delivers the XML file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.
Parameters
- parse_numbersSpecifies whether numbers are parsed or not. Range: boolean
- decimal_characterThis character is used as the decimal character. Range: char
- grouped_digitsThis option decides whether grouped digits should be parsed or not. If this option is set to true, a grouping character parameter should be specified. Range: boolean
- grouping_characterThis character is used as the grouping character. If this character is found between numbers, the numbers are combined and this character is ignored. For example if "22-14" is present in the CSV file and "-" is set as grouping character, then "2214" will be stored. Range: char
- date_formatThe date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in a CSV file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of date type. Range: string
- first_row_as_namesIf this option is set to true, it is assumed that the first line of the CSV file has the names of the attributes. Then the attributes are automatically named and first line of the CSV file is not treated as a data line. Range: boolean
- annotationsIf first row as names is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment. Remember row number in this menu does not count commented lines. Range: menu
- time_zoneThis is an expert parameter. A long list of time zones is provided; users can select any of them. Range: selection
- localeThis is an expert parameter. A long list of locales is provided; users can select any of them. Range: selection
- read_all_values_as_polynominalThis option allows you to disable the type handling for this operator. Every xpath entry will be read as a polynominal attribute. Range: boolean
- data_set_meta_data_informationThis option is an important one. It allows you to adjust the meta data of the CSV file. Column index, name, type and role can be specified here. The Read CSV operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter. Range: menu
- read_not_matching_values_as_missingsIf this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) in the CSV file is also read as a missing value. Range: boolean
- datamanagementThis is an expert parameter. A long list is provided; users can select any option from this list. Range: selection