On Automated Mining
One of the things I love the most about using statistical modeling software (especially Veera Predict) is that so much of the process is automated. Although automation has made the lives of statisticians much easier (calculating individual standard errors by hand would take hours for each variable), it is still important to be familiar with the methods and thinking that go into the variable selection process. One tab that does a lot of statistical heavy lifting for us is the Automated Mining tab, and I thought it would be good to explore some of the tests that are being used in that tab.
The function of the Automated Mining tab is to determine, variable by variable, which variables are statistically related to the selected yvariable, and which are not. The statistical test will vary from pair to pair depending on the types of variables being compared. One thing that is important to note is that we’re not doing any modeling or looking at the relationships between xvariables yet. The Automated Mining tab and its tests are only deciding which variables have the possibilityof being in the predictive model, not which ones will be.
Variable Under Evaluation


YVariable

Binary

Continuous

Categorical

Binary

ZTest

DecileChiSquare

ZTest

Continuous

ZTest

DecileFTest/ANOVA

ZTest

Categorical

n/a 
n/a 
n/a 
ChiSquare Test
A chisquare test is performed for any continuous xvariables used to predict a binary yvariable. In our Automated Mining tab, this test is performed on each of 10 deciles to determine whether or not the ‘ones’ are randomly distributed across the deciles. This test is more robust than using a linear correlation, as it captures nonlinear relationships as well as relationships that are not well fit by a curve or line.
ZTest
A Ztest is used for any binary or categorical predictors, regardless of the type of yvariable they’re trying to predict. It tests whether any category is significantly different in terms of the Y (relative to all other categories).
FTest
An Ftest is used whenever you have a continuous xvariable trying to predict for a continuous yvariable. In our Automated Mining tab, the data is sorted into deciles and an ANOVA test is run on these deciles to determine if the means are statistically different. This is more robust than a linear correlation, as it captures nonlinear relationships and those that do not fit a standard curve.
Caitlin Garrett, Statistical Analyst at Rapid Insight
