Extract Content (Web Mining)

Synopsis

Extracts content from an HTML document.

Description

This operator extracts textual content from a given HTML document and returns the extracted text blocks as documents. Only text blocks consisting of a given number of words are extracted to prevent single words (e.g. in navigation bars) to be kept.

Input

document
The document port.

Output

document
The document port.

Parameters

extract contentSpecifies whether content is extracted or not
minimum text block lengthThe minimum length (in words/tokens) of text blocks.
override content type informationSpecifies whether potentially existing content type information and used encoding information should be overriden using the HTML meta http-equiv tag.
neglegt span tagsSpecifies whether tags should be neglected or used as text block divider.
neglect p tagsSpecifies whether tags should be neglected or used as text block divider.
neglect b tagsSpecifies whether tags should be neglected or used as text block divider.
neglect i tagsSpecifies whether tags should be neglected or used as text block divider.
neglect br tagsSpecifies whether tags should be neglected or used as text block divider.
ignore non html tagsSpecifies whether tags that are not common HTML should be ignored.

Categories

Versions