Is your Analytic Model Good?
In business analytics, we often build many models to predict or inform. These distinct models may use unique classifiers or analytical techniques. How do we determine which analytic models are better? In this post, we will explore several methods of model comparison techniques to choose a good analytic model.
The Practical, Business-minded Model Comparison Approach
First and foremost, it is always important to separate your training and validation datasets before building models. Depending on the size of your dataset, you may set aside 10-50% of your data to simply see how your model performs on data that was not used in model building. If you are using time-series data, always choose the most recent data to use in validation. After your model is developed on the training dataset, you calculate predictions for the validation dataset. Based on your predictions, you can determine the accuracy of your model.
Model Accuracy = # of correct decisions made / Total number of decisions made = 1 – error rate
However, for most analytical business problems, this does not provide enough context to properly identify the best models. The next step is to separate these correct and incorrect decisions into sub-groups that allow for further insight into model comparison and performance. We classify each decision based on the actual value of the datapoint. Assuming our model is predicting a positive or negative response, we would create a 2×2 matrix with the prediction in the columns and the actual value in the rows. This contingency table is called the confusion matrix.
The confusion matrix identifies which predictions are correct or true, and which predictions are incorrect or false. For example, the false negative, denoted in the top right square in the image above, represents an occurrence where the model classified the observation as a negative outcome, but the actual value of the outcome was positive. The confusion matrix allows us to think more analytically about model comparison and how the predictions that were made may impact our objective.
To build on the confusion matrix, you should perform a cost/benefit analysis by identifying the costs and benefits associated with each decision based on the business context of the model’s outcomes. A simple example is a fund-raising mail campaign. We would calculate the cost of the mailing and materials, which would be associated with each positive prediction of donation. A negative prediction would not recieve any mailing, so the cost would be 0. If we were soliciting a $20 donation, the true positives would equate to $20 less the cost of the mailing. The false positives would equal the cost of the mailings. A true negative would have no impact, but a false negative would be a loss of the $20 donation. This extra context may enable better model comparison by providing insights on which models perform better with respects to the actual costs and benefits associated with various model outcomes.
Further, we can use the cost/benefit analysis to determine the expected value of our business action. To do this, we would multiply the costs by the total in each cell of the confusion matrix and sum the results. In our example, we would be able to state that the expect donation for each mailing would be X number of dollars, depending on the costs and total number of people in each cell.
One important thing to note is that you should be aware of how you are assigning costs and benefits. This will protect you from double assignment or double counting a cost or benefit. For example, if you decide that a donation is worth $20, then you may assign a true positive to be $20 or a false negative to be -$20, then what is the improvement in benefit of correctly classifying a mailing? You would calculate the following: $20 for the donation and -(-$20) for the missed opportunity, which would be $40. However, intuitively we would know that a donation is only worth $20. We are double counting by including a benefit and a cost in this situation. To correct this error, we should assign one of the costs/benefits to zero.
The Academic Approach
Another model comparison technique is using the field of statistics to gain insights on your model performance. There are various statistics to measure your model, so we will only focus on Akaike Information Criterion (AIC) and Reciever Operating Charactoristics (ROC Curves) for this post.
The AIC is a statistic developed based on information theory to evaluate models based on their relative complexity and the amount of information lost from choosing different model parameters. The basic notions that you need to know to use AIC are
- It penalizes models for using more parameters, and
- It considers the probability of actually obtaining your outcome (predictions) based on the parameter values
The general rule of thumb for model comparison purposes is that the model with the lower AIC is a better performing model. It may either be less complex, more likely to be accurate, or both. There are many statistics similar to the AIC, so please research your modeling method or look for future posts on actionable-business-analytics to learn more. A list of various model comparison statistics and their wikipedia links are shown below:
- Akaike information criterion
- Bayes factor
- Bayesian information criterion
- Deviance information criterion
- False discovery rate
- Focused information criterion
- Likelihood-ratio test
- Mallows’s Cp
- Minimum description length (Algorithmic information theory)
- Minimum message length (Algorithmic information theory)
- Structural Risk Minimization
- Stepwise regression
Another method of model comparison is using an ROC curve. An ROC curve is a graph representing the True Positive rate and the False Positive rates of various models. It is essentially a graph of plotted confusion matrices while varying the decision criterion of each model.
The basic idea behind using ROC Curves is that you want to choose the model that maximizes the True Positive rate and minimizes the False Positive rates, which corresponds to the upper left region of the graph. Again, it is ideal to incorporate the costs and benefits associated with each scenario into your decision making logic. As you may notice in the graph, the dotted diagonal red line denotes a baseline or random model. This essentially shows us a comparison of our model performance to 50-50 decision. In the ROC Curve, the bottom left point in the graph (0,0) represents the strategy of assigning all observations to negative. Similarly, the top right point on the graph (1,1) represents the strategy of assuming every observation is a positive outcome.
Furthermore, you can calculate the area under the ROC Curves to use as a summary statistic for the ROC Curve itself. A curve with larger area underneath theoretically represents a better model.
In conclusion, we identified several methods to determine which of our analytical business models are better than others. There are business-minded approaches that incorporate the effect of business decisions and academic approaches that utilize the field of statistics. Ultimately, you have to try different model comparison techniques and logically decide which works best based on your situation. There are occasions where each technique makes more sense to use, and often the various techniques are in agreement.