Categories

Versions

Split File by Content (Text Processing)

Synopsis

Segments documents based on regular expressions or xpath.

Description

Operator that allows to extract segments from a set of text documents in a directory based on regular expressions, XPath or simple string matching. This operator does support several formats as XML, HTML, Text and PDF, although XPath will work on XML and HTML documents only. The written files will be of the same ending as the input files type if possible. PDF for example will always be transformed into text files.

Input

  • through (File)

    The through port.

Output

  • through (File)

    The through port.

Parameters

  • previewShows a preview for the results which will be achieved by the current configuration.
  • matching modeThis parameter determines which mode for selecting the segments is used.
  • xpath querySpecifies the XPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
  • namespacesSpecifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h.
  • ignore cdataSpecifies whether CDATA should be ignored when parsing HTML
  • assume htmlIf checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this.
  • regular expressionSpecifies the regular expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
  • segment expressionSpecifies the expression, which is used to replace the found match of the regular expression above. Matchinggroups might be used to specify e.g. content of attributes without including the surrounding attributes itself.
  • start stringSpecifies the String used as startpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.
  • end stringSpecifies the String used as endpoint in string matching. The text between the start string and the end string, both exclusive, is threated as segment.
  • json path querySpecifies the JSONPath expression that matches against substrings of the content which should be treated as individual segments. Each match is treated as single segment.
  • textsA directory containing the documents to be segmented
  • outputThe directory to which to write the segments
  • use file extension as typeIf checked, the type of the files will be determined by their extensions. Unknown extensions will be treated as text files.
  • content typeThe content type of the input texts
  • encodingThe encoding used for reading or writing files.