Categories

Versions

You are viewing the RapidMiner Studio documentation for version 10.1 - Check here for latest version

Statistics (Model Simulator)

Synopsis

This operator calculates some basic statistics and distributions for all columns of the input data.

Description

This operator provides basic statistics such as average, standard deviations, value counts and more. In addition, it also provides some quality measurements for all attributes of your data set. You may want to consider discarding data columns (Attributes) that provide less value. Buy how do you know which Attributes are valuable, and which are worthless? This operator provides the quality measurements known from Auto Model as a summary table for your input data. They include:

  • ID-ness: columns where nearly all values are different,
  • Stability: columns where nearly all values are identical,
  • Missing: columns with missing values,
  • Text-ness: columns which look like they contain free text.

Here are more details about the calculations for those measurements:

  • ID-ness (I): measures the degree to which this Attribute resembles an ID. The number of different values for the Attribute divided by the number of data rows.
  • Stability (S): measures how stable or constant this column is. The number of rows with the most frequent non-missing value divided by the total number of data rows with non-missing values.
  • Missing (M): the number of missing values in this column as a fraction of the total number of data rows.
  • Text-ness (T): this is the average of the ID-ness, the fraction of cells containing token limiters, and a length-based score of the cell contents.

In general, you should prefer Attributes with low values for Missing, Stability, and ID-ness. In some cases, you may want to keep text columns as well.

Input

  • example set (Data Table)

    This input port expects a data set for which the statistics and quality measurements will be calculated.

Output

  • statistics

    This port delivers all statistics.

  • example set (Data Table)

    This port delivers the unchanged input data.

Tutorial Processes

Statistics for Titanic

This process calculates all statistics for the Titanic data set. It can easily be seen, that the passenger name and the ticket number have high values for ID-ness. We can also see that the cabin information is missing for most cases.