Read XML (Advanced File Connectors)
Synopsis
This operator is used for reading an XML file.Description
This operator can read XML files, where examples are represented by elements which match a given XPath and features are attributes and text-content of each element and its sub-elements.
This operator tries to determine an appropriate type of the attributes by reading the first few elements and checking the occuring values. If all values are integers, the attribute will become integer, if real numbers occur, it will be of type real. Columns containing values which can't be interpreted as numbers will be nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this attribute will be automatically parsed as date and the according feature will be of type date.
Input
- file
An XML file is expected as a file object which can be created with other operators with file output ports like the Read File operator.
Output
- output (Data table)
This port delivers the XML file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.
Parameters
- parse numbersSpecifies whether numbers are parsed or not.
- decimal characterThis character is used as the decimal character.
- grouped digitsThis option decides whether grouped digits should be parsed or not. If this option is set to true, a grouping character parameter should be specified.
- grouping characterThis character is used as the grouping character. If this character is found between numbers, the numbers are combined and this character is ignored. For example if "22-14" is present in the CSV file and "-" is set as grouping character, then "2214" will be stored.
- date formatThe date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in a CSV file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of date type.
- annotationsIf first row as names is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment. Remember row number in this menu does not count commented lines.
- time zoneThis is an expert parameter. A long list of time zones is provided; users can select any of them.
- localeThis is an expert parameter. A long list of locales is provided; users can select any of them.
- read all values as polynominalThis option allows you to disable the type handling for this operator. Every xpath entry will be read as a polynominal attribute.
- data set meta data informationThis option is an important one. It allows you to adjust the meta data of the CSV file. Column index, name, type and role can be specified here. The Read CSV operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter.
- read not matching values as missingsIf this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) in the CSV file is also read as a missing value.
- fileName of the file to read the data from.
- xpath for examplesThe matches of this XPath Expression will form the examples. Each match becomes one example whose attribute values are extracted from the matching part of the xml file.
- xpaths for attributesThis XPaths expressions will be evaluated for each match to the XPath expression for examples to derive values for attributes. Each expression forms one attribute in the resulting ExampleSet.
- allow ancestors in xpathsIf not checked, xpaths using parents and ancestors in the xpaths for attributes will lead to an empty result. Checking this might slow down the operator.
- use namespacesIf not checked, namespaces in the XML document will be completely ignored. This might make formulating XPath expressions easier, but elements with the same name might collide if separated by namespace.
- namespacesSpecifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.
- use default namespaceIf checkedyou can specify an namespace uri that will be used when no namespace is specified in the XPath expression.
- default namespaceThis is the default namespace that will be assumed for all elements in the XPath expression that have no explict namespace mentioned.
- infinity representationThis parameter can be set to parse a specific infinity representation (e.g. "Infinity"). If it is not set, the local specific infinity representation will be used.