Read CSV (RapidMiner Studio Core)
Synopsis
This Operator reads an ExampleSet from the specified CSV file.Description
CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical and text) in plain-text form. All values corresponding to an Example are stored as one line in the CSV file. Values for different Attributes are separated by a separator character. The separator remains constant. Each row in the file uses the constant separator for separating Attribute values. The term 'CSV' suggests that the Attribute values would be separated by commas, but other separators can also be used.
The easiest way to import a CSV file is to use the Import Configuration Wizard from the Parameters panel. All parameters can also directly be set in the Parameters panel. For more details about the Operator, see the description of the parameters.
Please make sure that the CSV file is read correctly as an ExampleSet before building a Process that uses it.
Differentiation
There are many Read <source> Operators in the Data Access group and Files/Read sub-group. For example, there is Read Excel, Read URL, Read SPSS, Read XML and other Operators, which can read ExampleSet from different file formats.
Input
- file (File)
A CSV file can be optionally passed in as a file object. This can be created with Operators having file output ports such as the Read File Operator.
Output
- output (Data Table)
This port delivers the ExampleSet created from the CSV file provided at the input port, imported through the Import Configuration Wizard or loaded from the path given to the csv file parameter.
Parameters
- Import_Configuration_Wizard
This user-friendly wizard guides you to easily configure this Operator to import the CSV file.
Range: - csv_file
The path of the CSV file is specified here. It can also be selected using the 'Choose a file' button.
Range: - column_separators
Column separators for CSV files can be specified here. It can also be provided as a regular expression. A good understanding of regular expressions can be developed by studying the description of Select Attributes Operator and its tutorial Processes.
Range: - trim_lines
This parameter indicates if lines should be trimmed (removal of empty spaces at the beginning and the end) before the column split is performed. This option might be problematic if TABs ('\t') are used as separators.
Range: - multiline_text
Indicates if text values may span multiple lines. If checked, line breaks enclosed by quotes are treated as part of the data.
Range: - use_quotes
This parameter indicates if quotes should be regarded. Quotes can be used to store special characters like column separators. For example if (,) is set as column separator and (") is set as quotes character, then a row (a,b,c,d) will be translated as 4 values for 4 columns. On the other hand ("a,b,c,d") will be translated as a single column value a,b,c,d. If this parameter is set to false, the quotes character parameter and the escape character parameter cannot be defined.
Range: - quotes_character
This parameter defines the quotes character and is only available if use quotes is set to true.
Range: - escape_character
This parameter specifies the character used to escape the quotes and is only available if use quotes is set to true. For example, if (") is used as quotes character and ('\') is used as escape character, then ("yes") will be translated as (yes) and (\"yes\") will be translated as ("yes").
Range: - skip_comments
This parameter is used to ignore comments in the CSV file (if any). If this option is set to true, a comment character should be defined using the comment characters parameter.
Range: - comment_characters
This parameter is available if comment characters is set to true. Lines beginning with these characters are ignored. If this character is present in the middle of the line, anything that comes in that line after this character is ignored. The comment character itself is also ignored.
Range: - starting_row
This parameter defines which row is the first to be read (everything before will be skipped). This index is one based and includes the header row into the count if the option is set.
Range: - parse_numbers
This parameter specifies whether numbers are parsed or not.
Range: - decimal_character
This character is used as the decimal character.
Range: - grouped_digits
This parameter decides whether grouped digits should be parsed or not. If this parameter is set to true, a grouping character parameter has to be specified.
Range: - grouping_character
This character is used as the grouping character. If this character is found between numbers, the numbers are combined and this character is ignored. For example if "22-14" is present in the CSV file and "-" is set as the grouping character, then "2214" will be stored.
Range: - infinity_string
This parameter can be set to parse a specific infinity representation (e.g. "Infinity"). If it is not set, the local specific infinity representation will be used.
Range: string - date_format
The parameter specifies the date and time format. Many predefined options exist but users can also specify a new format. If text in a CSV file column matches this date format, that column is automatically converted to date type.
Some corrections are automatically made on invalid date values. For example, a value '32-March' will automatically be converted to '1-April'.
Columns containing values which cannot be interpreted as numbers will be interpreted as nominal, as long as they do not match the date and time pattern of the date format parameter. If they match, this column of the CSV file will be automatically parsed as date and the corresponding Attribute will be of type date.
Range: - use_header_row
If this parameter is set to true, the header row parameter needs to specify the row with the names of the Attributes. If this is activated no annotations can be used.
Range: boolean - header_row
Sets the index of the row that should be used for Attribute names. The Attributes names are determined by the content of this row. As a consequence, data will only be read below this row. The parameter data set meta data information overwrites the data from the header row if it is set.
Range: - annotations
If the use header row is not set to true, annotations can be added using the 'Edit List' button of this parameter, which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any row, you can annotate them as Comment. Remember that row number in this menu does not count commented lines.
Range: - time_zone
Users can select any time zone from the list of provided time zones.
Range: - locale
Users can select any locale from the list of provided locales.
Range: - encoding
Users can select any encoding from the list of provided encodings.
Range: - read_all_values_as_polynominal
This option allows you to disable the type handling for this operator. Every column will be read as a polynominal attribute.
Range: - data_set_meta_data_information
This parameter allows to adjust or override the meta data of the CSV file. Column index, name, type and role can be specified here.
The Read CSV Operator automatically tries to determine an appropriate data type of the Attributes by reading the first few lines and checking the occurring values. Integer values are assigned the integer data type, real values the real data type. Values which cannot be interpreted as numbers are assigned the nominal data type, as long as they do not match the format of the date format parameter.
With the data set meta data information parameter, this automatic assignment can be adjusted or overwritten. This parameter also overwrites the information from the header row.
Range: - read_not_matching_values_as_missings
If this parameter is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example, if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) in the CSV file is also read as a missing value.
Range: - data_management
This parameter determines how the data is represented internally. Users can select any option from the provided list.
Range:
Tutorial Processes
Read a CSV file
(Optional) Save the following text in a text file:
att1,att2,att3,att4 # row 1
80.6, yes , 1996.JAN.21 ,22-14 # row 2
12.43,"yes",1997.MAR.30,23-22 # row 3
13.5,\"no\",1998.AUG.22,23-14 # row 4
23.3,yes,1876.JAN.32,42-65# row 5
21.6,yes,2001.JUL.12,xyz # row 6
12.56,",_?",2002.SEP.18,15-90# row 7
This is a sample CSV file.
(Optional) You can load this with the given tutorial process by providing its path in the csv file parameter or by using the 'Choose a file' button.
Run the Process and compare the results in the Results view with the CSV file. The Process performs the following actions:
'#' is defined as a comment character so 'row {number}' is ignored in all rows. As the first row as names parameter is set to true, att1, att2, att3 and att4 are set as Attribute names. The Attribute att1 is set as real , att2 as polynominal, att3 as date and att4 as real. For Attribute att4, the '-' character is ignored in all rows because the grouped digits parameter is set to true and '-' is specified as the grouping character. In row 2, the white spaces at the start and end of values are ignored because trim lines parameter is set to true. In row 3, quotes are not ignored because use quotes is set to true, the content inside the quotes is taken as the value for Attribute att2. In row 4, (\"no\") is taken as a (no) in quotes, cause the escape character is set to '\'. In row 5, the date value is automatically corrected from 'JAN.32' to 'Feb.1'. In row 6, an invalid real value for the Attribute att4 is replaced by '?' because the read not matching values as missings parameter is set to true. In row 7, quotes are used to retrieve special characters as values including the column separator (,) and a question mark.