Why Errors may be Beneficial in Analytics
Have you ever made a mistake on purpose? In analytics, there are occasions where it is beneficial to be wrong. In order to understand how this could be, let’s start by defining the types of errors that can occur during hypothesis testing, the statistical methodology to determine if a number is extreme enough to decide that it was not observed by random chance.
Errors and Hypothesis Testing
You might perform a hypothesis test to see if a new machine is more accurate than its counterpart. In this example, the Ho, or null hypothesis, claims that the accuracy is better in the new machine. If the test concludes that the accuracy is worse or equal when it really is more accurate, it is considered a Type II Error (false negative). We might return our new machine and re-install the old, only to achieve less accurate results. On the other hand, if the test concludes that the new machine is more accurate, when in truth the old machine is better or equal, we would have committed a Type I Error (false positive). In this case, we might keep the new machine and realize that the accuracy has diminished after the return policy has expired.
You can control for these types of errors. Statistically speaking, the α of a test, or the p-value cutoff, is equivalent to the Type I Error rate. The β of a test is related to a test’s power, and it is the same as the Type II Error rate. In our example, we may decide that keeping the expensive new machine although it isn’t producing more accurate results is a bad scenario. To account for this, we may try to reduce β by increasing the test’s power through various methods, such as increasing the sample size. But, what does this have to do with making errors on purpose?
Beneficial Errors in Regression
Hypothesis tests are also conducted on coefficients in statistical models to confidently say that the coefficient is different than zero. If the coefficient is not statistically different than zero, the parameter could actually be negative or positive, or it could have no effect on your response variable(s). In Linear Regression, Multiple Linear Regression or Multivariate Regression models, the independent variables determine the outcome of your predictions.
Generally, it is an error to leave a predictor in the model if it is not statistically significant; however, there are occasions where it is beneficial to keep them. For example, during the model validation process, you might decide through a literature review that a certain variable is important to predict your target variable. In this instance, your model may benefit by erroneously having that predictor, regardless of the statistical calculations, because it may improve the validity to regulators or stakeholders. This is often the case during stress testing for large banks because the Federal Reserve requires the use of macroeconomic variables in the models.
Another example of a beneficial error in regression is the Biased Regression technique, which is used to account for Multicollinearity in statistical models. Essentially, Multicollinearity occurs when two or more independent variables summarize similar information, resulting in many problems with statistical calculations. In Biased Regression, you use a variable reduction technique such as Principal Components Analysis to realign your independent variables. Next, you re-run your regression with the new component variables, but you omit the component summarizing the least variation (i.e.: the one having the smallest eigenvalue).
When you leave out this component, you are committing a beneficial error by deliberately biasing your regression, meaning your predictions will be slightly off target. The trade-off is less variance in your predictions. Overall, you will more accurately predict a target that is slightly off mark. The following image illustrates the results of a Biased Regression:
Analysts and statisticians typically prefer simple random samples, stratified random samples or clustered samples to use in their models and hypothesis tests, although in certain cases, it may be beneficial to make errors in these sampling methods. This generally occurs when the target variable is a rare occurrence. To demonstrate, you might predict the probability of fraud using a logistic regression model. Overall, fraud rarely occurs (we would hope), so the model would perform “best” by classifying all cases as not fraud. If we assume fraud occurs 1% of the time, the model would be correct 99% of the time. Unfortunately, this does not provide actionable analytics.
Another option would be to oversample the fraud cases, or use a sample with a higher proportion of fraud than normal. The goal is to distinguish the characteristics of fraudulent activity. Although oversampling is an error in traditional sampling methodology, it can prove beneficial in rare outcome studies. There are also methods to translate the results back to the population of interest.
Quantifying your Errors
At this point, we’ve explored how and why you would prefer to make deliberate errors in analytics, but truth be told, not all errors are the same. When possible, always quantify your errors to make sound business decisions. In our first example, we quantified the impact of new machinery cost to assist our decision-making. Similarly, you can assign values to most erroneous situations. The quantifications require justification from different points of view such as:
- Customer Perceptions
Always understand what actions the statistical model will provoke. In a new drug study, you should be extra cautious with Type I Errors (false positives) because the consequences of wrongly approving a new drug could be fatal. Contrastingly, if we are reviewing a potential ad campaign, you might not worry about sending 1,000 extra ads to customers that likely will not buy (Ex: compare the $0.15 cent ad cost to earning $150 in revenue for a new customer).
We are raised and educated on how to minimize errors, and taught to always do what is right. In analytics, on the contrary, there are occasions to make mistakes in order to benefit models and business decision-making. If anyone asks, be sure to blame Actionable-Business-Analytics.com for all your future mistakes!