Process Documents from Files (Text Processing)

Synopsis

Generates word vectors from a text collection stored in multiple files.

Input

word list
The word list port.

Output

example set (Data table)
The example set port.
word list
The word list port.

Parameters

text_directoriesIn this list arbitrary directories can be specified. All files matching the given file ending will be loaded and assigned to the class value provided with the directory. Range:
file_patternA pattern for the file to be read. Usual wildcards like ? and * are supported. Range:
extract_text_onlyIf checked, structural information like xml or html tags will be ignored and discarded. Range:
use_file_extension_as_typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files. Range:
content_typeThe content type of the input texts Range:
encodingThe encoding used for reading or writing files. Range:
create_word_vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document. Range:
vector_creationSelect the schema for creating the word vector. Range:
add_meta_informationIf checked, available meta information of the text like filename, date is added as attribute. Range:
keep_textIf checked, the input text will be stored as a special String attribute with the role text. Range:
prune_methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified. Range:
prune_below_percentIgnore words that appear in less than this percentage of all documents. Range:
prune_above_percentIgnore words that appear in more than this percentage of all documents. Range:
prune_below_absoluteIgnore words that appear in less than that many documents. Range:
prune_above_absoluteIgnore words that appear in more than that many documents. Range:
prune_below_rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned. Range:
prune_above_rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned. Range:
datamanagementDetermines, how the data is represented internally. Range:
parallelize_vector_creationDetermines whether the execution of Vector Creation should be parallelized. Range:

Categories

Versions

Process Documents from Files (Text Processing)

Synopsis

Input

Output

Parameters