Process Documents (Text Processing)
Synopsis
Generates word vectors from a text object.Description
This operator uses one single TextObject as input for generating a term vector. The resulting exampleset will hence consist of only one single example. This makes this operator especially useful for applying a model on one single text. But since the SingleTextInputOperator even provides a parameter for specifying the text, this one is more appropriate if used by a program, where a TextObject might simply be constructed and passed to the process.
Input
- word list
The word list port.
- documents (Collection)
The documents port.
Output
- example set (Data table)
The example set port.
- word list
The word list port.
Parameters
- create word vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document.
- vector creationSelect the schema for creating the word vector.
- add meta informationIf checked, available meta information of the text like filename, date is added as attribute.
- keep textIf checked, the input text will be stored as a special String attribute with the role text.
- prune methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
- prune below percentIgnore words that appear in less than this percentage of all documents.
- prune above percentIgnore words that appear in more than this percentage of all documents.
- prune below absoluteIgnore words that appear in less than that many documents.
- prune above absoluteIgnore words that appear in more than that many documents.
- prune below rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
- prune above rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
- datamanagementDetermines, how the data is represented internally.
- parallelize vector creationDetermines whether the execution of Vector Creation should be parallelized.