Eystx Blog: Understanding and Correcting Sample Bias in Data Science

In the realm of applied data science, a prevalent challenge is inferring accurate conclusions about a population from a dataset that may not adequately represent it. Take, for instance, political polling via landline, often resulting in a disproportionate number of older and female respondents. This type of sampling bias can significantly distort summary statistics, such as the overall support for a political candidate, and impact the accuracy of predictive models.

A parallel issue arises in the commercial sector when companies base their models on current customer data while aiming to attract a different demographic segment. Relying solely on existing customer data for statistical analyses and business growth strategies can lead to misleading insights about potential new customer segments. Furthermore, models developed from this data tend to be fine-tuned for the demographic they represent, rather than the new target demographic.

The accompanying figure serves as an illustrative example of how sample bias can affect the decision boundary between two classes, using predictors X1 and X2. It highlights the varying distribution biases in each scenario

In this blog post, we delve into the challenge of modeling political preferences for one of two candidates using a non-representative training set, illustrated by the example data mentioned earlier. We will explore strategies to correct summary statistics, such as the overall support for a candidate, impacted by this bias.

Imagine we have two variables, X1 and X2, for each individual in our database. Our goal is to develop a model predicting preference for Candidate A over Candidate B, leveraging these variables. This model should also accurately represent a specific group of interest, like likely voters.

For model creation, we gather candidate preferences from numerous individuals to understand the interplay between X1, X2, and candidate preference. We’ll call this collection of data the ‘training set’, while the ‘test set’ represents the broader population we’re interested in.

Ideally, we would want our training set to mirror the test population, but obtaining a representative sample is often impractical. There could be issues like low survey response rates in the test set, or the survey might have been completed before identifying our target group. Consequently, we need to devise a way to build a classifier using this skewed training set that still performs accurately on the test set.

The subsequent plot illustrates both the training and test sets, and introduces the classifier we aim to develop. Note that in real-world applications, the labels for the test set are typically unknown.”

This revision aims to clarify the process of training a classifier on biased data, highlighting the challenges and objectives involved.

We can see that the differences between these datasets makes the classifier that’s best on the training set perform suboptimally on the test set. What can be done about this?

A common approach is to modify the training of the classifier, using individual sample weights that alter the penalty imposed by the loss function when samples are misclassified. The classifier will suffer an increased penalty when up-weighted samples are misclassified and a lower penalty when down-weighted samples are misclassified. In this manner, the classifier will ‘see’ the training set as if it were properly distributed over X1 and X2, and learn a classification scheme which is optimal for the test set distribution.

Perhaps the most straight-forward approach to this problem is to separately estimate the joint covariate distributions for the test set and the training set, and then assign each training point in covariate space a weight given by the ratio of its test set density to its training set density.

So how do we do this here? In general, it usually works best to model the joint distributions using a nonparametric approach, and below we will discuss the Iterative Proportional Fitting Procedure (IPFP) technique for this purpose. However, our simple example lends itself to modelling the covariates of the training and test sets as bivariate Gaussians, so we just need to estimate the mean vectors and covariance matrices. We then refit the training classifier with these weights. The result is in the next figure.

The outcome is that the decision boundary is far closer to the ideal boundary for the test classifier than where we started.

Estimating Weights with IPFP

It won’t be true in general that the covariate distributions can be well-approximated through a simple parametric model so in the more general case the binning-based Iterative Proportional Fitting Procedure (IPFP) can be used to arrive at weights.

IPFP is an incremental learning process in which an initialisations weight close to one is first assigned to every sample. In the procedure, we iterate over lower-dimensional marginal (binned) distributions. Within each marginal, we calculate the sample weights which would be necessary for each bin to have exactly the same probability in the test set as in the training set. We then update each sample’s weight in proportion to the difference between the current weight and the optimal for this marginal, according to some learning rate. Iteration happens until an error tolerance condition on the difference between the population distribution and the weighted sample distribution is met.

This algorithm has in its favor extreme simplicity, numerical stability and a long history of publications demonstrating its algorithmic properties such as conditions of convergence. Nevertheless, authors have erred towards modesty, restricting themselves to corrections over a small number of features.

Perhaps that is wise. An intrepid data scientist who uses IPFP to correct over a large feature set will be implicitly dealing with a very noisy high-dimensional PDF and will find their diligence punished by an enormous penalty in standard error. The data scientist may want to try limiting the noise through a dimensionality reduction algorithm, arriving at approximately two orthogonal feature embeddings to correct over.

Pitfalls of Bias Correction with Weights

Deriving explicit weights has its limitations:

High-dimensional joint density estimation is a very hard problem in general and can introduce a lot of uncertainty into the point estimates unless we restrict ourselves to considering only a few features (but if we have to first select those features then we still have added uncertainty).
It may not be very natural to handle a mix of continuous and discrete variables without discretizing the continuous variables.
It is possible to derive extremely large weights that give certain observations too much influence, severely harming the effective sample size and causing an explosion in model variance. We find that better results are often obtained by shrinking weights towards uniformity.

Finally, when all we care about is the ratio of two densities, it seems inefficient to separately estimate two joint densities as a precursor. Indeed, there are other techniques such as kernel mean matching that directly approximate the quantity of interest.

Necessary Assumptions for Bias Correction

More generally, when does correction work and when doesn’t it work? There are two assumptions that need to mostly hold for our approaches to be successful.

Nested supports: every element of the population of interest needs to have a non-zero probability of appearing in the training set.
Covariate shift: conditioned upon all of our covariates, the responses are independent of the sampling bias.

Nested support is necessary for our weights to be well-defined, because otherwise the denominator in our ratio of densities will be zero in some places. Covariate shift is more subtle: in our example, if sampling were biased by X1 and X2 but not by the actual response, then this assumption would be satisfied. Because the response may also be correlated with X1 and X2, we may find that the response is marginally associated with the sampling bias, but covariate shift is only violated if the sampling bias is correlated with the response even after we’ve controlled for the predictors. This assumption is important, because otherwise the optimal correcting weights would require knowledge of the test set labels — something we don’t have.

Correcting a Statistic

What if instead of a corrected classifier, we just want to correct our estimate of the proportion of Candidate A supporters? In the survey sampling literature, a popular technique for this is multilevel regression and poststratification (MRP). If we were to use this technique, we would first discretize our sample bias correction features to form a grid. We would then use a multilevel logistic regression to assign a modelled level of support to each cell of the grid. The final result would come from an average of these support levels, weighted according to each cell’s probability under the test distribution. In the standard MRP setting, the test distribution must be externally known, because no individual observations are available from the population of interest. Often, the external distribution is derived from a census.

In our case, however, we have individual-level data from the population of interest. This allows us to present a slightly modified approach, which does not depend upon sourcing population marginal or joint distributions and does not require binning of the correction features.

Under covariate shift, the conditional probability that a person prefers Candidate A over B given their values of X1 and X2 does not depend on the sampling bias. To that end, we will estimate these probabilities using an uncorrected logistic regression on the training set. Now we can take a large number of samples with replacement from the test set, compute the predicted conditional probability for each, and then average them. Provided that our test set is sizeable enough that these bootstrapped samples closely follow the true underlying distribution, then this average converges to the marginal probability of support for Candidate A in the test set. This relies upon the additional assumption that the logistic regression predictions are consistent. This consistency assumption is extremely unlikely to fully hold in practice, but if the model performs decently then this will provide a nice and simple way to approximately correct this statistic, without needing any external distribution information.

Looking at our example, the sample proportion of Candidate A supporters is 49.9% in the test set, but only 36.8% in the training set. But if we use the above approximation then we get a point estimate of 50.7%, a marked improvement.

Note that we have not touched on how to compute correct standard errors. Frequently, the required covariance computations become intractable with estimated weights because the weights cannot sensibly be assumed to be independent. Indeed, the need for proper inference is a big part of the reason for the difference between the approaches in the machine learning community, such as kernel mean matching, and approaches like MRP that are used in a more inferential survey sampling context.