Categories

Versions

Process Documents from Files (Text Processing)

Synopsis

Generates word vectors from a text collection stored in multiple files.

Input

  • word list

    The word list port.

Output

  • example set (Data table)

    The example set port.

  • word list

    The word list port.

Parameters

  • text directoriesIn this list arbitrary directories can be specified. All files matching the given file ending will be loaded and assigned to the class value provided with the directory.
  • file patternA pattern for the file to be read. Usual wildcards like ? and * are supported.
  • extract text onlyIf checked, structural information like xml or html tags will be ignored and discarded.
  • use file extension as typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.
  • content typeThe content type of the input texts
  • encodingThe encoding used for reading or writing files.
  • create word vectorIf checked, the tokens of a document will be used to generate a vector numerically representing the document.
  • vector creationSelect the schema for creating the word vector.
  • add meta informationIf checked, available meta information of the text like filename, date is added as attribute.
  • keep textIf checked, the input text will be stored as a special String attribute with the role text.
  • prune methodSpecifies if to frequent or to infrequent words should be ignored for word list building and how the frequencies are specified.
  • prune below percentIgnore words that appear in less than this percentage of all documents.
  • prune above percentIgnore words that appear in more than this percentage of all documents.
  • prune below absoluteIgnore words that appear in less than that many documents.
  • prune above absoluteIgnore words that appear in more than that many documents.
  • prune below rankWords are ordered by frequency and words with a frequency less than the frequency of the rank given by this percentage will be pruned.
  • prune above rankWords are ordered by frequency and words with a frequency higher than the frequency of the rank given by this percentage will be pruned.
  • datamanagementDetermines, how the data is represented internally.
  • parallelize vector creationDetermines whether the execution of Vector Creation should be parallelized.