Generate TFIDF (RapidMiner Studio Core)

Synopsis

This operator performs a TF-IDF filtering of the given ExampleSet. TF-IDF is a numerical statistic which reflects how important a word is to a document.

Description

The Generate TFIDF operator generates TF-IDF values from the given ExampleSet The ExampleSet must contain either the binary occurrences (which will be normalized during calculation of the term frequency TF) or it should already contain the calculated term frequency values (in this case no normalization will be done). This behavior can be selected using the calculate term frequencies parameter.

The TF-IDF (term frequency–inverse document frequency) is a numerical statistic which reflects how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.

Input

  • example set input (IOObject)

    This input port expects an ExampleSet. It is output of the Read CSV operator in the attached Example Process.

Output

  • example set output (IOObject)

    The TF-IDF is calculated and the resultant ExampleSet is returned through this port.

  • original (IOObject)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

Parameters

  • calculate_term_frequenciesThis parameter indicates if term frequency values should be generated. This parameter must be set to true if the input data is given as simple occurrence counts. Range: boolean

Tutorial Processes

Introduction to the Generate TFIDF operator

This Example Process starts with a Subprocesses operator which generates a sample ExampleSet. A breakpoint is inserted here so that you can have a look at the ExampleSet. This is a very simple ExampleSet. It has a text attribute which has different words. There are three integer attributes named Doc1, Doc2 and Doc3 that have the count of the corresponding words in these documents. The Generate TFIDF operator is applied on this ExampleSet to calculate the TFIDF. The resultant ExampleSet can be seen in the Results Workspace.