You are viewing the RapidMiner Studio documentation for version 8.2 - Check here for latest version
Read Sparse (Advanced File Connectors)
SynopsisThis operator is used for reading files written in sparse formats.
This operator reads sparse format files. The lines of a sparse file have the form:
label index:value index:value index:value...
Where index may be an integer (starting with 1) for the regular attributes or one of the prefixes specified by the prefix map parameter. The following formats are supported:
- xy format: The label is the last token in each line.
- yx format: The label is the first token in each line.
- prefix format: The label is prefixed by 'l:'
- separate file format: The label is read from a separate file specified by the label file parameter.
- no label: The ExampleSet is unlabeled.
- output (Data Table)
This port delivers the required file in tabular form along with the meta data. This output is similar to the output of the Retrieve operator.
- formatThis parameter specifies the format of the sparse data file. Range: selection
- attribute_description_fileThe name of the attribute description file is specified here. An attribute description file (extension: .aml) is required to retrieve meta data of the ExampleSet. This file is a simple XML document defining the properties of the attributes (like their name and range) and their source files. The data may be spread over several files. This file also contains the names of the files to read the data from. Therefore, the actual data files do not have to be specified as a parameter of this operator. Range: filename
- data_fileThis parameter specifies the name of the data file. It is necessary if it is not specified in the attribute description file. Range: filename
- label_fileThis parameter specifies the name of the file containing the labels. It is necessary if the format parameter is set to 'format separate file' Range: filename
- dimensionThis parameter specifies the dimension of the example space. It is necessary if the attribute description file parameter is not set. Range: integer
- sample_sizeThis parameter specifies the maximum number of examples which should be read. If it is set to -1, then all examples are read. Range: integer
- use_quotesThis parameter indicates if quotes should be regarded. If this option is set to true, the quotes character parameter can be used for specifying the quotes character. Range: boolean
- quotes_character This parameter defines the quotes character. Range: char
- datamanagementThis parameter determines how the data is represented internally. This is an expert parameter. There are different options, users can choose any of them. Range: selection
- decimal_point_characterThis character is used as the decimal character. Range: string
- prefix_mapThis parameter maps prefixes to names of special attributes. Range: list
- encodingThis is an expert parameter. A long list of encoding is provided; users can select any one of them. Range: selection
Writing and Reading a sparse file
This Example Process shows the Write AML operator can be used for writing a sparse file and how the Read Sparse operator can be used for reading a sparse file. The 'Golf' data set is loaded using the Retrieve operator. This ExampleSet is provided as input to the Write AML operator. The example set file parameter is set to 'D:\golf_data' thus a file named 'golf_data' is created (if it does not already exist) in the 'D' drive of your computer. You can open the written file and make changes in it (if required). This file has the instances of the ExampleSet. The attribute description file parameter is set to 'D:\golf_att' thus a file named 'golf_att' is created (if it does not already exist) in the 'D' drive of your computer. You can open the written file and make changes in it (if required). This file has the meta data of the ExampleSet. The format parameter is set to 'sparse_xy' to write the file in xy sparse format. The Read Sparse operator is applied next to read the ExampleSet from the files. The attribute description file and data file parameters are set to 'D:\golf_att' and 'D:\golf_data' respectively. The format parameter is set to 'xy' because the file was written in xy format. All other parameters are used with default values. The resultant ExampleSet can be seen in the Results Workspace.