Process Documents from Files (Text Processing)

Synopsis

Generates word vectors from a text collection stored in multiple files.

Input

word list
The word list port.

Output

example set (Data table)
The example set port.
word list
The word list port.

Parameters

text directoriesIn this list arbitrary directories can be specified. All files matching the given file ending will be loaded and assigned to the class value provided with the directory.
file patternA pattern for the file to be read. Usual wildcards like ? and * are supported.
extract text onlyIf checked, structural information like xml or html tags will be ignored and discarded.
use file extension as typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.
content typeThe content type of the input texts
encodingThe encoding used for reading or writing files.
create word vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document.
vector creationSelect the schema for creating the word vector.
add meta informationIf checked, available meta information of the text like filename, date is added as attribute.
keep textIf checked, the input text will be stored as a special String attribute with the role text.
prune methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
prune below percentIgnore words that appear in less than this percentage of all documents.
prune above percentIgnore words that appear in more than this percentage of all documents.
prune below absoluteIgnore words that appear in less than that many documents.
prune above absoluteIgnore words that appear in more than that many documents.
prune below rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
prune above rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
datamanagementDetermines, how the data is represented internally.
parallelize vector creationDetermines whether the execution of Vector Creation should be parallelized.

Categories

Versions

Process Documents from Files (Text Processing)

Synopsis

Input

Output

Parameters