Questions for Future X’s: 3 things you should ask your predictive variables

jon macmillan
senior data analyst

Just like dating, predictive modeling is an iterative process. We sometimes have to go back and revise some of the steps that we took, or wipe the slate clean and start over. In order to prevent some of the common mistakes that come with predictive modeling, there are a couple of questions that we should ask our future x-variables before we decide to use them in our analysis.

1. Are you too good to be true?

Variables that have a very high correlation and test statistic are variables to keep an eye on. These variables are often directly related to the outcome and while highly predictive, they are generally not things that you would want to use to help predict the outcome.

Example: In law enforcement and corrections, it can be valuable to try and predict an individual’s likelihood to commit a second crime after their first arrest.  This gives you an opportunity to identify risk of recidivism and hopefully intervene to prevent it.  One of our customers was trying to do just that, and had a hunch that if an individual changes residences a lot, they may be more likely to commit a second crime.

We were able to look at the distinct number of addresses for each individual to determine how many addresses were listed on file.  We did find that there was a very strong correlation between multiple addresses and recidivism rates.

This was not because it was a true relationship, but because they were in the system multiple times for multiple offenses.  This meant that the vast majority of individuals showed as having multiple addresses simply because they committed more than one crime and thus, was an unreliable x-variable.

2. Would you be available when I need you?

Variables that are time sensitive are always something to be extra cautious with.  That isn’t to say that you can never use these types of variables, but we need to make specific provisions in the data preparation in regards to these fields. Always plan the data around what is available when you are looking to use the results of the model. This may mean that you need to make sure that certain events are captured up to a particular point in time, which would be determined by when you are looking to apply this model.

Example: Student success – it’s something that every college in the country is focused on.  When students don’t succeed it negatively impacts the student, the institution, and society as a whole.  Currently, four out of every ten students pursuing a bachelor’s degree end up dropping out.  To improve these figures, institutions are predicting an individual’s likelihood to retain. This is so they can proactively reach out to students to avoid any potential roadblocks.  When building these models, it is enticing to include things like the students term GPA.  However, if you want to be able to apply the model at the start of the term, where you would have more time to react, then you would not have this information available when you would like to apply the model.

A more subtle and possibly more challenging example of this would be something like how often a student meets with an advisor.  If you are able to track this information, you need to be sure that a student would have had an opportunity to meet with an advisor at the point in which you are applying the model.

If you want to be able to apply a model at the start of the term, pick a date at which you would like to do so or decide on a  certain number of days from the start of the term. When preparing the data for modeling, limit the collection of that information to the specific point in time you chose. This is so you are only looking at advisor contacts on or before that specific date.

3. Would I introduce you to my parents?

This last question is something that we should ask all of our future X’s. Does the variable make sense in relation to what you are predicting and, if so, can you explain the relationship? One of the hardest aspects of predictive modeling is sharing this information and conveying what this means to those that will be using it. You need to be able to easily and clearly show how this relationship makes sense and what it means for your organization.

Example: Let’s look at it from a sales perspective. Imagine receiving a list of leads where each prospect was assigned a probability score of purchasing and how beneficial that may be.  This is exactly what some of our customers are doing.  When looking at a customer’s likelihood to purchase, or response to a particular marketing campaign, sometimes the distance between customer and retailer show the inverse relationship to what you might expect.  For instance, you may find that customers who live closer to a retailer are actually less likely to purchase.  Now we need to come up with a hypothesis as to why and be able to test it and prove or disprove the theory.

In this case, we could hypothesize that customers living closer to the retailer were living in urban locations, meaning there is more competition for in-store purchases. We could then show that the reason we had more purchases from farther customers was driven by our online sales, giving evidence to support our original hypothesis.

Are there any additional questions that you think you should ask your future X’s? Please let us now in the comments below!