Categories

Versions

Crawl Web (Web Mining)

Synopsis

This operator allows to crawl the web and store the retrieved links and pages in an ExampleSet or on disk.

Description

This crawler will start on the specified starting URL to load pages and follow all links as commanded by the rules. There are different types of rules, each one applied in different situations:

  • store_with_matching_url: If the regular expression matches the URL, this page will be stored in the resulting ExampleSet and on disk (if selected).
  • store_with_matching_content: If the page content contains the given term, this page will be stored in the resulting ExampleSet. Note: Using this filter will slow down crawling a lot! Also note that this is NOT a regular expression but a simple contains filter.
  • follow_link_with_matching_url: If the regular expression matches the URL, the crawler will follow the link and load the URL.

To avoid crawling a potentially unlimited number of pages, the maximal number of pages and depth the crawler will retrieve can be specified with the parameters max pages and max depth. To speed up loading, the delay can be lowered. But please be friendly to the web site owners and avoid causing high traffic on their sites. Otherwise you may get blacklisted. Note that while the crawling makes use of your available CPU cores (license limits apply), usually crawling speed is limited by your bandwidth, disk IO (if applicable), the crawling delay and the fact that this crawler is benign and queries the robots.txt for each page it visits.

Please let the ignore robot exclusion parameter be unchecked unless you are going to crawl your own sites. Some site owners might forbid crawling of their content and for legal reasons you may be bound to their will.

Output

  • Example Set (Data table)

    The example set port which returns the crawling results.

Parameters

  • urlThe root page from which the crawler will start.
  • crawling rulesSpecifies a set of rules that determine which links to follow and which pages to process.
  • retrieve as htmlIf selected, the actual HTML is returned instead of a textual representation.
  • enable basic authIf selected, all requests will send basic auth information in their header. Use only when crawling HTTPS pages!
  • usernameUsername for basic authentication.
  • passwordPassword for basic authentication.
  • add content as attributeSpecifies, whether the pages' content should be added as a text attribute.
  • write pages to diskSpecifies if the crawled pages should be saved as files.
  • include binary contentIf selected, the crawler will also consider binary content instead of only text pages. This can be useful to for example download all .pdf files from a web site by making use of the crawling rules parameter.
  • output dirSpecifies the directory on disk into which the files are written if write pages into files is selected.
  • output file extensionSpecifies the file extension of the stored files.
  • max crawl depthSpecifies the maximal depth of the crawling process. A depth of 1 means 'only crawl direct links on the initial page'.
  • max pagesThe maximal number of pages to store.
  • max page sizeSpecifies the maximum page size (in KB): pages larger than this limit are not downloaded.
  • delaySpecifies the delay when visiting a page in milliseconds.
  • max concurrent connectionsMaximum amount of HTTP connections used at the same time.
  • max connections per hostMaximum amount of simultaneous HTTP connections used to connect to a single host. Increasing this parameter can put heavy load on a host so please be careful!
  • user agentThe identity the crawler uses while accessing a server.
  • ignore robot exclusionSpecifies whether the crawler should ignore the robot exclusion rules set by the crawled page. Enable only for your own sites, otherwise you may end up violating laws!