# Build models

Auto Model Web is designed to help you build predictive models from your data – fast and simple. All you need is a data set (like an Excel spreadsheet) and something you want to predict. It's that simple!

As discussed in the introduction, we will guide you through the following steps:

- Upload Data -- upload all the data that's possibly relevant
- Choose Column -- choose the column whose values you want to predict
- Select Inputs -- decide what's relevant and eliminate what's irrelevant
- Select Models -- select and build one or more models

By the end of step (4), you will have created one or more models. After that, you can inspect the models and decide which one best suits your purpose.

## Step 1: Upload Data

Your privacy is important. Please, do not upload data containing personally identifiable information.

We recommend you either remove columns containing such information, or use anonymization or pseudonymization.

Auto Model Web accepts data in a spreadsheet format, either Excel or CSV.
To use Auto Model Web, find a data set with data whose values you would like to predict, and make sure it has this format.
Then you can upload that data file to Auto Model Web by clicking the **Upload Data** button, or simply drag and drop your file in the box.

If you don't have a data set available, and you simply want to take a quick look at the application,
press the button **Use Sample Dataset**, and select "Churn Prediction Data".

## Step 2: Choose Column

In what follows, we'll discuss the consequences of choosing the sample data set, "Churn Prediction Data". The data concerns customers of a phone company, who may or may not give up on their subscription. Who will stay and who will go? And why? If we can answer these questions, the phone company can make changes to improve their customer satisfaction.

One of the data columns -- we'll call it the *target column* -- has values that you want to predict.
In our current example, the target column is "Churn", since we want to predict who will churn.
From the dropdown menu, choose "Churn" before clicking **Next**.

In general, the values of the target column can be numerical (like "CustServ Calls") or categorical (like "Churn"). Depending on your target column, the problem will fall into one of the three following categories:

**Binary classification**- Categorical data, two possible values (like "Churn")**Multiclass classification**- Categorical data, three or more possible values**Regression**- Numerical data (like "CustServ Calls")

Choose a column, and Auto Model Web will automatically detect what type of problem it has to solve. Additional details for for each type of problem are given below.

**Binary Classification**(predicting one of exactly two possible values)Some questions have a yes-or-no answer. For example, if you take a medical test, the results are often described as

*positive*or*negative*:**Positive**: the test found what you were looking for (e.g., an infection)**Negative**: the test did not find what you were looking for (e.g., no infection)

If the result is positive, a more thorough investigation may be necessary; if the result is negative, no more work is needed. Arguably, the positive result is more important and deserves a higher degree of attention, because the focus of medical work is to treat the infection.

Our current problem, where "Churn" takes the values "yes" or "no", is an example of a binary classification problem, with the focus on "yes", since we want to predict which customers will churn.

**Multiclass Classification**(predicting one of three or more possible values)If your target column has three or more

*non-numerical*values, your problem is called a multiclass classification problem.**Regression**(predicting numerical values)If your target column is numerical, and you want to predict the numbers in that column, your problem is called a regression problem. For example, in our "Churn Prediction Data", there is a column called "CustServ Calls" whose value is the number of times a customer has called customer service.

## Step 3: Select Inputs

Not all of your data columns will help you make a prediction. By discarding some of the columns, you may speed up your model-building and / or improve the model's performance. But how do you make that decision? A key point is that you're looking for patterns. Without some variation in the data and some discernible patterns, the data is not likely to be useful.

### Selection Criteria

The four criteria that Auto Model Web uses to determine if a particular column is useful are:

**Correlation**- how closely do the values resemble the target column?**ID-ness**- how different are the values from one another?**Stability**- how similar are the values to one another?**Missing**- how many missing values are in the column relative to the total?

Auto Model Web takes each column of your data and assesses its quality according to these four criteria. Each of the criteria is measured as a percentage from 0 to 100%. In general, for data to be useful, it should have few missing values, low ID-ness, and not-too-high stability. Correlation is more tricky: very low values of the correlation imply low quality, because the data has no relation to the target column, but high values of the correlation may also be problematic, as discussed below.

### Quality Tags

Each column is marked with a quality tag:

- Red - Poor quality data
- Yellow - Needs further examination

A column with a red tag has one or more of the following problems:

- High Missing Values - More than 70% of all values in this data column are missing.
- High ID-ness - The data column has many different values (e.g. an ID column) relative to the number of rows in your data set.
- High Stability - More than 90% of all values in the data column are the same (stable).
- No correlation - This data column has no relation to the target column.
- Perfect correlation - The information in this data column is redundant. See below.

A column with a yellow tag has one of the following problems:

- Low Correlation - a correlation of less than 0.01% indicates that this data column has no relation to the target column and won't be useful for prediction.
- High Correlation - can indicate high quality or low quality! Keep reading.

To understand the issue with high correlation, consider an extreme example: perfect correlation. If you have two columns called X and Y, and X = Y, then the correlation is 100% and X is just another name for Y. If you are predicting X, you would discard the column called Y, because it's redundant. It may be redundant even if the correlation is less than 100%. Ask yourself the following question: will I have access to the data in the highly- correlated column prior to making a prediction? If not, the data is not useful.

In some cases, however, the column is useful for prediction, precisely because it is highly correlated with the target column. Only you can tell for certain. In case of doubt, you can create two models: one with the highly-correlated column and one without, to help you decide which is best.

### Select inputs, Churn Prediction Data

Auto Model Web identifies the following issues with our Churn Prediction Data:

**High ID-ness**: the "Phone" number is an ID, unique to each customer. It has no value in predicting churn.**Many missing values**: only 3% of the customers have international charges ("Intl Charge"), so this data column won't tell us much.**Low correlation**: there is zero correlation between "Account Length" and "Churn". It seems that there is little or no relation between the time a customer has been with the phone company and the probability that he will churn, so "Account Length" is unlikely to be useful.

By default, all of these data columns are deselected. There is one additional column that has been deselected, but it requires further discussion.

**High correlation**: "CustServ Calls" has a 57% correlation with "Churn"

Apparently, the number of customer service calls is a good indicator of churn.
The phone company would be well-advised to take proactive steps to keep
the customer if the customer has called customer service repeatedly.
But do you want to include "CustServ Calls" when building your model?
Let's return to the question we asked a moment ago:
will I have access to the data in the highly-correlated column prior to making a prediction?
In this case, the answer is *yes*.
We choose therefore to include "CustServ Calls" in our model, with the understanding that the
predictions of the model will be heavily weighted towards the value in that column.

If it was already obvious to you that a large number of customer service calls leads to churn, you might exclude this column, and extract what information you can from the rest of the data -- you might even try to predict the number of customer service calls!

Jump ahead to see the results with and without the customer service call data

## Step 4: Select Models

Auto Model Web provides some of the more popular machine learning algorithms, including the following:

**Decision Tree**: a simple, tree-like flowchart model which is easy to understand

**Naïve Bayes**: a simple and fast probabilistic model based on Bayes' Theorem

**Logistic Regression**: a widely-used statistical model for binary classification

**Generalized Linear Model (GLM)**: a generalization of multiple linear regression models

Depending on the type of data in your target column, only a subset of these algorithms may be available.

Binary classification | Multiclass classification | Regression | |
---|---|---|---|

Decision Tree | |||

Naïve Bayes | |||

Logistic Regression | |||

Generalized Linear Model |

Select the models you want to include, and press **Run Analysis**.

**Next**: Inspect models

### Further Reading

The links to the RapidMiner Documentation below provide more information about the predictive model algorithms used in Auto Model Web: