Text Vectorization (Model Simulator)
Synopsis
This operator can be used for basic feature extraction from text columns like TFIDF vectorization, adding sentiments, or detecting languages.Description
This operator is a simplified version of the text processing operators available in extensions. It takes one or several text or nominal columns and transforms them into a vectorized format using TFIDF. While being simpler to use, this operator only offers a subset of features from the text mining extension. One major advantage is the fact that users can simply select the number of features to be added and - if a label column is defined - only the most relevant features will be added to the example set. If no label is defined, pruning based on frequency is applied to bring the amount of columns down to the desired number.
This operator only performs tokenization, to lower case, and TDIDF calculation. In addition, it can also extract the sentiment of each text column and detect its language out of a set of 20 languages. Please note, however, that a generic sentiment analysis often only delivers directionally correct results and is not comparable to specific domain-based models with respect to accuracy. Those two additional columns for each input text column will become special attributes and would need to be transformed to regular attributes afterwards if they are desired to be inputs for machine learning models.
The operator delivers a pre-processing model which can be applied to new data sets to perform the same processing on this data. This is necessary for transforming scoring data sets in the same way as training data sets.
Input
- example set input (Data table)
This input port expects a data set. At least one of the columns should contain free text. It can be either nominal or of type text.
Output
- example set output (Data table)
This output port provides the transformed data, i.e. the original data set with the extracted TFIDF columns, sentiment scores, or languages.
- original (Data table)
The original input data without any changes.
- preprocessing model
The text processing model which shows useful information about the extracted features and can be applied on new (scoring) data sets in order to perform the same transformations there.
Parameters
- add sentiment Indicates if a column with the most likely sentiment (positive, negative, neutral) should be added for each of the processed text columns.
- add language Indicates if a column with the most likely content language should be added for each of the processed text columns.
- keep original Indicates if the original text attributes should be kept or removed.
- store training documents Indicates if the documents used for building the word vectors should be stored as part of the model. This is useful for visualizations but will increase memory usage.
- store scoring documents Indicates if the documents which are transformed during scoring should also be stored as part of the model. This is useful for visualizations but will increase memory usage especially in continous use.
- document class attribute The name of the nominal attribute which should be used for deriving the document classes (typically the label attribute).
- token split This regular expression is used to split the tokens from each other. Default is word boundaries.
- apply pruning Indicates if pruning should be applied to the resulting columns.
- max number of new columns The maximum number of columns after pruning generated for each of the input text columns.
Tutorial Processes
Text Vectorization and Model Application
This process loads a data set with movie reviews from the community samples repository. Each review is stored as full text together with the information if this was a positive or a negative review. The goal is to build a machine learning model which can predict from a new text if this text is positive or negative. This task is also known as sentiment analysis.
After loading the data, the process splits it into two parts, one for training and one for testing the model. The training part is delivered to the Text Vectorization operator. Please note that the settings of this operator. We have specified that we want to work on the input column named "text". We also defined that we have a label we want to predict (only works for nominal attributes, i.e. for classification tasks). This can help the operator to make smarter pruning decisions for the selected bag of words which will be used as the output columns of this operator.
Talking about pruning: we also defined that we want to bring the set of extracted output features down to 1500. Without pruning, each word in the analyzed language could become a column in your data set which often is introducing too much noise. The pruning tries to bring down the number of columns to the most meaningful ones. If you have defined a class to be predicted (like we do here), it will take this information into account, too.
We then train a prediction model on the transformed training data. We use a fast variant of a support vector machine here, but in general linear SVM work well for most text classification tasks after some tuning of the parameter C.
Finally, we can apply the delivered preprocessing model from the Text Vectorization also on the scoring data. This makes sure that we extract exactly the same columns from the scoring data we have trained the model on. Finally, we use another operator Apply Model to also created the predictions on the test data.
Examining the output from each of the visualization tools, we find the following:
- Text vectorization: This is the visualization of the preprocessing model delivered by the Text Vectorization operator. It shows which words are ultimately used and how often they occurred in the training data. It also allows to browse through the documents and shows other helpful information.
- Prediction model: The prediction model used to predict if a review is positive or negative. You need to make sure that model is applied on the same structure of data it has been trained on. This can be achieved by applying the preprocessing model above on new data sets first.
- Data with predictions: The data with all extracted columns as well as the predictions for each review.