Categories

Versions

Quality Measures (Model Simulator)

Synopsis

This operator calculates some quality measurements for the columns of the input data. Those measurements can help you to select attributes for modeling or not.

Description

This operator provides quality measurements for all attributes of your data set. You may want to consider discarding data columns (Attributes) that provide less value. Buy how do you know which Attributes are valuable, and which are worthless? This operator provides the quality measurements known from Auto Model as a summary table for your input data. They include:

  • ID-ness: columns where nearly all values are different,
  • Stability: columns where nearly all values are identical,
  • Missing: columns with missing values,
  • Text-ness: columns which look like they contain free text.

Here are more details about the calculations for those measurements:

  • ID-ness (I): measures the degree to which this Attribute resembles an ID. The number of different values for the Attribute divided by the number of data rows.
  • Stability (S): measures how stable or constant this column is. The number of rows with the most frequent non-missing value divided by the total number of data rows with non-missing values.
  • Missing (M): the number of missing values in this column as a fraction of the total number of data rows.
  • Text-ness (T): this is the average of the ID-ness, the fraction of cells containing token limiters, and a length-based score of the cell contents.

In general, you should prefer Attributes with low values for Missing, Stability, and ID-ness. In some cases, you may want to keep text columns as well.

Input

  • example set (Data Table)

    This input port expects a data set for which the quality measures will be calculated.

Output

  • example set (Data Table)

    This port delivers a data set with the calculated quality measures.

Tutorial Processes

Quality Measures for Titanic

This process calculates the four quality measurements ID-ness, Stability, Missings, and Text-ness for the Titanic data set. It can easily be seen, that the passenger name and the ticket number have high values for ID-ness with the names also resembling text columns somewhat. We can also see that the cabin information is missing for most cases.