Enrich Data by Webservice (Web Mining)

Synopsis

This operator retrieves information from a webservice and attaches the extracted result to the examples.

Description

This operator allows to extract additional attributes from structured or unstructured web service results or simpler HTTP requests using regular expression, XPath or simple string matching. The input texts are requested from the specified URL. For flexibly including information from each example into the request, every <%attribute name%> is replaced by the current examples value of the attribute with the given name. Since some webservice provider restrict the number of requests per second, a delay might be used to obey this restriction.

The query type for extracting information from the response might be either XPath for XML documents, or regular expressions for less structured texts. The XPath expression specifies directly which part of the XML document is retrieved and this is used as value for the new attribute. If you use regular expressions, the first matching group is used as value. For example an expression like "Name:\s*(.*)\n" on a text "Name: Paul" followed by a line break will yield "Paul" as new value in the attribute.

Note that the default content type for POST requests is text/xml (HTTP header Content-Type: text/xml). To use other types, define the property Content-Type in the request properties (e.g., application/json for JSON based services).

String matching is a fast and easy to use replacement for regular expressions, but less powerful. You just have to specify a start and an end string. Everything between the two strings is extracted. For example if the start string would be "Name:" and the end string a linebreak, then the result of the above text would be " Paul".

The response might contain a separated list of results, for example a XML tag like this: <languages>en,de,fr,sp</languages> Then it is possible to enter the a query yielding "en,de,fr,sp" multiple times, using different attribute names. If the separator parameter contains the ",", then the first attribute will be filled with "en" the second with "de" and so on. This might be used to get only the first enumerated value, too. But be careful with this feature, since other results might be splitted, too, even if you don't enter a query twice. You might avoid this, by inserting a second operator, where you don't specify a separator.

Input

in (Data table)

Output

out (Data table)

Parameters

query type Specifies the type of the query. Range: selection
string machting queries Specifies a list of string matching start and end sequences. Everything between will be used as result. See the operator documentation for details on string matching. Range: list
attribute type Specifies the type of the resulting attributes. If numerical or binomial is choosen, ensure that the returned result is interpretable. Range: selection
regular expression queries Specifies a list of attribute names and their corresponding regular expressions. The first matching group is used as value. See the operator documentation for details on regular expressions. Range: list
regular region queries Specifies a list of attribute names and their corresponding regular expressions. Two regular expressions might be specified in order to define the start and the end of a region. Everything in between the two matches will be delivered as result. Range: list
xpath queries Specifies a list of attribute names and their corresponding XPath queries. See the operator documentation for details on XPath. Range: list
namespaces Specifies pairs of identifier and namespace for use in XPath queries. The namespace for (x)html is bound automatically to the identifier h. Range: list
ignore CDATA Indicates if CDATA should be ignored when using the XPATH expression. Range: boolean
assume html If checked a more tolerant xml parser will be used, which copes with forbidden HTML constructions, but always assumes HTML and adds missing tags. For plain XML uncheck this. Range: boolean
index queries Specifies a list of attribute names and the regions. Regions are specified as offset index and length of the match. Range: list
url The url of the HTTP GET based service. This URL may contain terms of the form <attributeName>, including the braces, that are replaced by the value of the corresonding attribute before invoking the query. Range: String
separator Characters used to separate entries in the result field obtained by XPath or regular expression. Range: String
delay Amount of milliseconds to wait between requests Range: Integer
encoding The encoding used for reading or writing files. Range: selection
user_agent The User-Agent http request header property. Range: String
keep_sensitive_headers Keep "Authorization" and "Cookie" header during a redirect to a different domain or subdomain. Range: boolean

Categories

Versions