How to Handle Missing Values?

Practical Advice from Experts: Preventing, Managing Missing Data + Using Synthetic Data

Under the 'Community Twist' tag, we offer practical solutions to questions frequently asked on various forums. Usually, there are 2-3 commentators revealing how they address the issue in their work. Enjoy, and share if you find it useful!

How to Handle Missing Values?

{Technical level: Advanced}

Dealing with missing data is a frequent and tricky issue you'll face when working with real-world datasets. In this article, we'll explain why missing values happen, how to reduce them, and what to do when you can't avoid them. We'll also look at how synthetic data can help solve this problem.

For a well-rounded view, we've consulted three data experts:

  1. Ryan Kearns, a Founding Data Scientist at Monte Carlo

  2. David Berenstein, a Developer Advocate at Argilla

  3. Abhishek Pawar, a Senior Data Scientist at Precisely

Let’s dive in!

1. What is missing data?

Missing values are like missing puzzle pieces in an otherwise complete picture. In a dataset that tracks the height and weight of people, missing values show up when one or both of these details are missing. Just like a puzzle missing a piece can't give you the full image, these gaps in data can make it difficult to get a complete understanding of what you're studying. While these gaps can occur because some people choose not to share this info, that's not the only reason.

In general, missing values can be attributed to several factors, including:

  1. Human Error: Mistakes made during data collection or processing. These errors may arise from oversight, misinterpretation, or other inadvertent actions.

  2. Machine Error: Technical glitches or equipment malfunctions. When data collection relies on machinery or automated systems, any disruptions in their functioning can lead to gaps in the dataset.

  3. Respondent Refusal: I don’t want to answer! This refusal may be due to privacy concerns, discomfort, or other personal reasons.

  4. Drop-Outs: Respondents in longitudinal studies are lost over time. When research spans a long period, people may either withdraw or be lost to follow-up. This is particularly problematic if they contribute crucial information.

The presence of missing values in your data can have a significant impact on the quality of your predictions. But is there a way to minimize the occurrence of missing data in your business?

2. Strategies for Minimizing Missing Data

There are several best practices focused on ensuring data quality and avoiding missing data. Ryan Kearns has shared how these practices are implemented within their company.

The best practice is a combination of being declarative about your invariants and using automated data observability to handle scale.

High-importance, load-bearing data assets need to be treated with priority (you'll hear phrases like "certified gold data," Airbnb has their "Midas Certified" standard, etc.). For these datasets, employ code-based checks such as:

"I expect these 5 categorical values in this column; so if I ever see 4 or 6 distinct values, that's a problem."

On the scalability point, we actually invest a lot into our Monte Carlo Monitors-as-Code setup, which configures monitors for things like the above. dbt tests can also provide this sort of coverage. We benefit from a mix of monitoring where critical assets, like the feature table immediately serving an ML model, have detailed testing and validation on every important field, while the rest of the pipeline upstream of this has "lighter" monitoring like freshness and volume metadata monitoring that MC has out-of-the-box.

Ryan Kearns, a Founding Data Scientist at Monte Carlo

Ryan stressed some important steps for getting good data and making a data-focused culture in your company. For startups or teams new to data, there are extra key points to think about:

  • Training Data Users: Teaching your data team the right skills is crucial. Training and classes can help them understand why good data matters, follow the rules and use the best tools effectively.

  • Building a Data Quality Mindset: It means everyone is on the same page about the vision, values, and goals related to data quality. It involves fostering a sense of ownership and accountability for data quality among data users.

However, there can be situations when avoiding missing values is simply not feasible. This can occur for various reasons, including the irreversibility of data collection, occasional errors, or the nature of the data itself. This is what Abhishek Pawar told us about:

Missing data often carries inherent signals or patterns within the dataset. Every instance of missing data is typically indicative of a specific underlying cause. Therefore, to effectively mitigate these biases, it is imperative to gain a comprehensive understanding of the data's origin, generation process, and the mechanisms by which it was captured within the system.

Abhishek Pawar, a Senior Data Scientist at Precisely

In such scenarios, our objective shifts to mitigating the impact of these missing values.

3. Potential biases due to missing data

In the realm of modern machine learning, which heavily relies on data, the lack of or inadequate representation of specific data types can severely degrade the performance of predictive systems. Ryan Kearns provided examples:

For example, if one of your APIs is misconfigured and sending junk data, you might end up serving irrelevant ads to customers using the browser where that API is supposed to be working.

You can also make misinformed decisions on the business intelligence side, if data fails to reach a downstream Tableau asset, you might end up convinced that your sales in some segments have tanked when in reality the volume just isn't getting through the pipeline.

Ryan Kearns

David Berenstein reminded us about the biases we’ve seen during the rise of pre-trained language models and later large language models.

Since the introduction of fine-tuning by transfer learning, missing data can be the root cause of significant biases and inequalities in core models like BERT.

We’ve seen this during the introduction of new models being firstly and sometimes solely available within their domain, language, or cultural space, and, despite their more general nature, we can see this ever so clearly with the rise of LLMs.

David Berenstein, a Developer Advocate at Argilla

One of the strategies to mitigate the effect of missing values involves model tracking and deployment monitoring. What else can you do? Let's explore more ways to manage missing data and how to choose the best one for you.

4. Handling missing values

Dealing with missing values can involve:

  1. Removing data points that have gaps.

  2. Filling in those gaps with guesses or estimates is known as "imputation."

There are many imputation methods from both traditional statistics and machine learning. But the big question is:

How do we pick the best one from all these options?

This is an important question that requires domain knowledge because each approach introduces different kinds of bias.

For example, with a time series where data is missing from 08:00 to 12:00, the decision to delete, impute, or backfill data categorically depends on what the data represents and what you want your model to be doing.

For time series anomaly detection like Holt-Winters, you need data either imputed or backfilled because the model can't fit any seasonal predictions if periods are missing. If instead, you're treating the time axis like a feature and not a time series, you can probably do with deleting the missing data or not include it.

Also, deciding between imputation and backfilling should be context-sensitive – if the data in question represents a meter reading, linear interpolation or median imputation might work, but if it represents page views per hour instead you need to consider whether filling with zeros makes more sense. Even linear interpolation vs. median vs. mean imputation each introduces different biases to the model and statistical details here can get quite complicated.

Ryan Kearns

As Ryan mentioned, the effectiveness of handling missing data ultimately hinges on thoughtful design and rigorous analysis. This assessment of performance depends on various factors, including the choice of algorithm, attribute selection, and sampling techniques.

We asked David Berenstein to share his expertise regarding the handling of missing values within the domain of natural language processing (NLP), particularly focusing on the increasingly popular large language models (LLMs) that have garnered significant attention in recent times.

Besides the general push towards creating core models for lower resource languages and domains, we have also seen solutions that impute knowledge based on:

  • Smart heuristics like ensemble model workflows,

  • Cross-lingual model capabilities,

  • and (even) more general models.

An example of an ensemble model workflow that I have worked with during my graduation thesis is the usage of paraphrasing methods to diversify data. There are several direct paraphrase models available but another great approach for doing this is back-translation.

Here n-hop translations between different languages cause the text to slightly deform. Using Google Translate, we can even do this with a span-preserving setting, which might be useful for span-based tasks like entity extraction. We can even apply a 1-hop translation to use lower-resource-language data for an English transformer model.

Due to lexical and semantic overlap in languages, it has proven to be possible to train models with cross-lingual properties. These models can be used to make predictions on multiple languages, hence, fine-tuning a model on language A will ensure it is possible to make decent predictions for language B as well. This approach cannot easily be benchmarked but it does offer a way to start higher quality data, which in turn can be curated for training a language-specific model.

Lastly, we’ve come to the section of (even) more general models, like LLMs.

Simply said, this is achieved with more data and parameters.

Everyone reading this article has heard stories about the vast volume of data and amount of parameters of the recent OpenAI models, but researchers behind the previously mentioned cross-lingual models and even “Attention is All You Need” found that this direction was a viable solution to significantly improve performance.

David Berenstein

Abhishek Pawar continued the topic of the importance of domain knowledge:

I've refrained from imputing any data that relies on domain-specific expertise, especially in fields like healthcare. In such cases, it's always advisable to seek guidance and consultation from Subject Matter Experts (SMEs) prior to undertaking any imputation processes.

Abhishek Pawar

5. Synthetic data

Synthetic data seems to be a new fuel for data science. But to what extent can synthetic data serve as a viable solution for handling missing values? And what are the potential pitfalls or challenges associated with missing data?

Synthetic data is an incredibly powerful tool. 

Famously, DeepMind used a hybrid training approach with AlphaZero using lots of synthetic data, so there is precedent to this being done correctly. But you need to be careful (we're not all DeepMind) especially if the real data generating process you're trying to emulate is complicated.

Again, this question is very contextual since certain model architectures will respond to synthetic data in different ways. If your architecture is interrogable, like a gradient-boosted decision tree, you can probably get away with more synthetic data since your model engineers can identify gaps in the representation and iterate on better synthetic data generation in response.

Ryan Kearns

Conclusion

In the labyrinth of real-world data science, missing data is less an anomaly and more a constant companion. Whether you opt for minimizing data gaps through a data observability framework, make calculated decisions on imputations based on domain-specific expertise, or experiment with synthetic data, the key lies in understanding the origin and nature of your dataset. Ultimately, a thorough understanding of your data's origins, the context in which it exists, and the specific objectives of your analytical tasks are essential for choosing the most effective approach to manage missing values.

Feel free to share your own strategies or thoughts by reaching out to us, as this is an evolving discourse in the data science community: You can reply to this email, leave a comment, or send us a note at [email protected]

If you find it useful, please share it across your social networks or use our referral system 👇 Your referrals will be growing, and eventually, it will lead to some great gifts 🤍 We are working on them right now!

Thank you for your support!

Subscribe to keep reading

This content is free, but you must be subscribed to Turing Post to continue reading.

Already a subscriber?Sign In.Not now

Join the conversation

or to participate.