Read CSV (RapidMiner Studio Core)

Synopsis

This operator is used to read CSV files.

Description

CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical and text) in plain-text form. CSV files have all values of an example in one line. Values for different attributes are separated by a constant separator. It may have many rows. Each row uses a constant separator for separating attribute values. CSV name suggests that the attributes values would be separated by commas, but other separators can also be used.

For complete understanding of this operator read the parameters section thoroughly. The easiest and shortest way to import a CSV file is to use the import configuration wizard from the Parameters panel. The best way, which may require some extra effort, is to first set all the parameters in the Parameters panel and then use the wizard. Please make sure that the CSV file is read correctly before building a process using it.

Input

  • file (File)

    A CSV file is expected as a file object which can be created with other operators with file output ports like the Read File operator.

Output

  • output (Data Table)

    this port delivers the CSV file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.

Parameters

  • Import Configuration WizardThis option allows you to configure this operator by means of a wizard. This user-friendly wizard makes the use of this operator easy. Range: menu
  • csv_fileThe path of the CSV file is specified here. It can be selected using the choose a file button. Range: string
  • column_separatorsColumn separators for CSV files can be specified here in a regular expression format. A good understanding of regular expression can be developed from studying the Select Attributes operator's description and Example Processes. Range: string
  • trim_linesThis option indicates if lines should be trimmed (empty spaces are removed at the beginning and the end) before the column split is performed. This option might be problematic if TABs are used as separators. Range: boolean
  • use_quotesThis option indicates if quotes should be regarded. Quotes can be used to store special characters like column separators. For example if (,) is set as column separator and (") is set as quotes character. (a,b,c,d) will be translated as 4 values for 4 columns. On the other hand ("a,b,c,d") will be translated as a single column value a,b,c,d. If this option is set to false, the quotes character parameter and the escape character parameter for quotes cannot be defined. Range: boolean
  • quotes_character This option defines the quotes character. Range: char
  • escape_character_for_quotesThis is the character that is used to escape quotes. For example if (") is used as quotes character and (\)is used as escape character. ("yes") will be translated as (yes) and (\"yes\") will be translated as ("yes"). Range: char
  • skip_commentsThe skip comments option is used to ignore comments in the CSV file. This is only useful if the CSV file has comments. If this option is set to true, a comment character should be defined using the comment characters parameter. Range: boolean
  • comment_charactersLines beginning with these characters are ignored. If this character is present in the middle of the line, anything that comes in that line after this character is ignored. Remember that the comment character itself is also ignored. Range: string
  • parse_numbersSpecifies whether numbers are parsed or not. Range: boolean
  • decimal_characterThis character is used as the decimal character. Range: char
  • grouped_digitsThis option decides whether grouped digits should be parsed or not. If this option is set to true, a grouping character parameter should be specified. Range: boolean
  • grouping_characterThis character is used as the grouping character. If this character is found between numbers, the numbers are combined and this character is ignored. For example if "22-14" is present in the CSV file and "-" is set as grouping character, then "2214" will be stored. Range: char
  • date_formatThe date and time format is specified here. Many predefined options exist; users can also specify a new format. If text in a CSV file column matches this date format, that column is automatically converted to date type. Some corrections are automatically made in date type values. For example a value '32-March' will automatically be converted to '1-April'. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of date type. Range: string
  • first_row_as_namesIf this option is set to true, it is assumed that the first line of the CSV file has the names of the attributes. Then the attributes are automatically named and first line of the CSV file is not treated as a data line. Range: boolean
  • annotationsIf first row as names is not set to true, annotations can be added using the 'Edit List' button of this parameter which opens a new menu. This menu allows you to select any row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned a Name annotation, it is equivalent to setting the first row as names parameter to true. If you want to ignore any rows you can annotate them as Comment. Remember row number in this menu does not count commented lines. Range: menu
  • time_zoneThis is an expert parameter. A long list of time zones is provided; users can select any of them. Range: selection
  • localeThis is an expert parameter. A long list of locales is provided; users can select any of them. Range: selection
  • encodingThis is an expert parameter. A long list of encodings is provided; users can select any of them. Range: selection
  • data_set_meta_data_informationThis option is an important one. It allows you to adjust the meta data of the CSV file. Column index, name, type and role can be specified here. The Read CSV operator tries to determine an appropriate type of the attributes by reading the first few lines and checking the occurring values. If all values are integers, the attribute will become an integer. Similarly if all values are real numbers, the attribute will become of type real. Columns containing values which can't be interpreted as numbers will be interpreted as nominal, as long as they don't match the date and time pattern of the date format parameter. If they do, this column of the CSV file will be automatically parsed as date and the according attribute will be of type date. Automatically determined types can be overridden using this parameter. Range: menu
  • read_not_matching_values_as_missingsIf this value is set to true, values that do not match with the expected value type are considered as missing values and are replaced by '?'. For example if 'abc' is written in an integer column, it will be treated as a missing value. A question mark (?) in the CSV file is also read as a missing value. Range: boolean
  • data_managementThis is an expert parameter. A long list is provided; users can select any option from this list. Range: selection

Tutorial Processes

Reading a CSV file

Save the following text in a text file and load it with the given Read CSV Example Process. Run the process and compare the results in the Results Workspace (data view) with the CSV file.

att1,att2,att3,att4 # row 1

80.6, yes , 1996.JAN.21 ,22-14 # row 2

12.43,"yes",1997.MAR.30,23-22 # row 3

13.5,\"no\",1998.AUG.22,23-14 # row 4

23.3,yes,1876.JAN.32,42-65# row 5

21.6,yes,2001.JUL.12,xyz # row 6

12.56,",_?",2002.SEP.18,15-90# row 7

Here is some explanation of what happens in this process: '#' is defined as comment character so 'row no.' is ignored in all rows. As the first row as names parameter is set to true att1,att2,att3 and att4 are set as names of attributes att1 is set as real , att2 as polynomial, att3 as date and att4 as real in attribute att4 ,'-' are ignored because the grouped digits parameter is set to true and '-' is the grouping character In row 2 the white spaces at start and at end of values are ignored because trim lines parameter is set to true. In row 3 quotes are used but they are ignored because the escape character is not used. In row 4 the escape quote is used, so quotes are not ignored. In row 5 the date value is automatically corrected, 'jan.32' is changed to 'feb.1'. In row 6 an invalid real value in forth column is replaced by '?' because the read not matching values as missings parameter is set to true. In row 7 quotes are used to store special characters including the column separator and a question mark.