Categories

Versions

Extract Document (Text Processing)

Synopsis

This operator converts a value stored in an ExampleSet into a Document.

Description

The Extract Document operator creates a new Document, containing a specified data value from the input ExampleSet. It can be used on any value type but is typically used on attributes containing text values. The entry in the ExampleSet is referenced by specifying an attribute name and a row index.

Input

  • example set (Data table)

    This input port expects an ExampleSet.

Output

  • example set (Data table)

    The ExampleSet that was given as input is passed without changing to the output through this port. This is usually used to reuse the same ExampleSet in further operators or to view the ExampleSet in the Results Workspace.

  • output

    The extracted Document containing the specified data value from the ExampleSet is output of this port.

Parameters

  • attribute_nameThe name of the attribute from which the data value should be extracted. The attribute name must exist in the ExampleSet that was given as input. Range: string
  • example_indexThis parameter lets you specify the index in the ExampleSet from which the Document should be extracted. Numbering starts at 1. Range: integer

Tutorial Processes

Working with text values using the Extract Document operator

Let us assume that we have a text containing attributes as ExampleSet. With the help of the Extract Document operator, we can extract this text and work with it as a Document, using operators from the Text Processing Extension.

In our Example Process, we first generate an ExampleSet using the Generate Data by User Specification operator. This ExampleSet contains a single row of text. The data is then being transposed, so the text values can be found under a single attribute. For the sake of clarity, the attribute name is changed to 'Text' using the Rename operator. A breakpoint is inserted after the Rename operator so you can see the resulting ExampleSet.

Now the Extract Document operator is used to extract the content of the 'Text' attribute at the second row in our ExampleSet. As the resulting Document contains the desired information, we can now work with it e.g. using the Tokenize operator to divide the text into words.