Categories

Versions

Extract Content (Web Mining)

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

  • document

    The document port.

Output

  • document

    The document port.

Parameters

  • extract contentSpecifies whether content is extracted or not
  • minimum text block lengthThe minimum length (in words/tokens) of text blocks.
  • override content type informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.
  • neglegt span tagsSpecifies whether <span> tags should be neglected or used as text block divider.
  • neglect p tagsSpecifies whether <p> tags should be neglected or used as text block divider.
  • neglect b tagsSpecifies whether <b> tags should be neglected or used as text block divider.
  • neglect i tagsSpecifies whether <i> tags should be neglected or used as text block divider.
  • neglect br tagsSpecifies whether <br> tags should be neglected or used as text block divider.
  • ignore non html tagsSpecifies whether tags that are not common HTML should be ignored.