Process Documents from Web (Web Mining)

Synopsis

This operator allows to crawl the web and preprocess the single pages before storing them with additional information in an ExampleSet.

Description

This operator is quite similar to the Crawl Web operator, but additionally allows to extract information from web pages without any need to save the complete page at first. This behavior might be more appropriate if you are going to crawl a huge number of pages but discard most of the content.

An advanced settings where to use this operator is when you crawl pages that consist of sub parts that are interesting for you. You might cut the document of the web page inside this operator using a Cut operator and deliver the whole collection of documents to the inner sink of this operator. Each document of the collection will become one example. If you have attached additional meta information by using e.g. the Extract Information operator, this will be stored as additional attribute.

The internal crawler will start on the specified starting URL to load pages and follow all links as commanded by the rules. There are different types of rules, each one applied in different situations:

store_with_matching_url: If the regular expression matches the URL, this page will be stored in the resulting ExampleSet.
store_with_matching_content: If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
follow_link_with_matching_url: If the regular expression matches the URL, the crawler will follow the link and load the URL.

To avoid crawling a potentially unlimited number of pages, the maximal number of pages and depth the crawler will retrieve can be specified with the parameters max pages and max depth. To speed up loading, the delay can be lowered. But please be friendly to the web site owners and avoid causing high traffic on their sites. Otherwise you may get blacklisted. Note that while the crawling makes use of your available CPU cores (license limits apply), usually crawling speed is limited by your bandwidth, the crawling delay and the fact that this crawler is benign and queries the robots.txt for each page it visits.

Please let the ignore robot exclusion parameter be unchecked unless you are going to crawl your own sites. Some site owners might forbid crawling of their content and for legal reasons you may be bound to their will.

Output

example set (Data table)
The example set port which returns the crawling results.

Parameters

urlThe root page from which the crawler will start.
crawling rulesSpecifies a set of rules that determine which links to follow and which pages to process.
retrieve as htmlIf selected, the actual HTML is returned instead of the textual representation.
enable basic authIf selected, all requests will send basic auth information in their header. Use only when crawling HTTPS pages!
usernameUsername for basic authentication.
passwordPassword for basic authentication.
add content as attributeSpecifies, whether the pages' content should be added as a text attribute.
max crawl depthSpecifies the maximal depth of the crawling process. A depth of 1 means 'only crawl direct links on the initial page'.
max pagesThe maximal number of pages to store.
max page sizeSpecifies the maximum page size (in KB): pages larger than this limit are not downloaded.
delaySpecifies the delay when visiting a page in milliseconds.
max concurrent connectionsMaximum amount of HTTP connections used at the same time.
max connections per hostMaximum amount of simultaneous HTTP connections used to connect to a single host. Increasing this parameter can put heavy load on a host so please be careful!
user agentThe identity the crawler uses while accessing a server.
ignore robot exclusionSpecifies whether the crawler should ignore the robot exclusion rules set by the crawled page. Enable only for your own sites, otherwise you may end up violating laws!

Categories

Versions

Process Documents from Web (Web Mining)

Synopsis

Description

Output

Parameters