In the domain of Data Mining, it is quite important to handle the missing values and outliers in a dataset since it would immensely affect the data analysis and the business decisions if not properly taken care of. Although it seems negligible for beginners to think of Missing Values and Outliers in a dataset, it is good practice to keep an eye on them from the elementary days of your Data Science journey.
Let's simply discuss what types of Missing Data can we come across when working with Datasets.
Types of Missing Data
Missing Completely at Random (MCAR)
MCAR means that the data that is missing is independent of any of the variables in the dataset. In simple terms, the data that is missing is completely occurring at random and has no effect on the observed or unobserved variables of the dataset.
Imagine collecting data on customer purchases but accidentally losing some records due to a technical issue with the data storage system. This could be considered MCAR, as the missing data is unrelated to any specific customer or purchase characteristics.
In these instances, the missing data reduce the analyzable population of the study and consequently, the statistical power, but do not introduce bias: when data are MCAR, the data that remain can be considered a simple random sample of the full data set of interest.
Missing at Random (MAR)
MAR is a little different from MCAR. MAR occurs when the probability of data being missing is related to some observed variables in the dataset. In other words, missingness depends on some known characteristics, but not on the unobserved values themselves.
For example, a survey examining depression may encounter data that are MAR if male participants are less likely to complete a survey about depression severity than female participants. That is, if the probability of completion of the survey is related to their sex (which is fully observed) but not the severity of their depression, then the data may be regarded as MAR. Here we know that the missingness of the severity column is because the candidate individual is a male.
Missing Not at Random (MNAR)
When data are MNAR, the fact that the data are missing is systematically related to the unobserved data, that is, the missingness is related to events or factors that are not measured by the researcher. MNAR means that the probability of being missing varies for reasons that are unknown to us.
MNAR is the most problematic type of missing data because it's very difficult to account for and can introduce significant bias into your analysis.
Handling Missing Values: Ad-Hoc Solutions
Missing Values can be handled in several ways. The method that is most suitable to your dataset needs to be picked after examining the nature of missing values. Let us dive into the types of solutions we can provide if data is missing.
- Listwise deletion
This method removes entire observations from the dataset if they contain any missing values, even for a single variable. This implementation is simple and carried out by most beginners. If the data proportion that we delete in such a way is a significant amount, this method is useless. It can also introduce bias if the missing data is not MCAR (missing completely at random) and is related to the variables of interest. This can lead to underestimating or overestimating the true relationships between variables.
- Pairwise deletion
Here what we do is we take a pair of variables in the dataset at a time and analyze the non-missing values in them. Once the analysis is done for that particular pair, it takes into consideration another pair of variables. This method analyzes each pair of variables separately and only uses observations with complete data for that specific pair. It essentially performs the analysis multiple times, once for each pair of variables with complete data.
It may be useful for simple analysis purposes with fewer number of variables. However, the analysis done this way also reduces the sample size for that particular pair of variables each time since the missing values are dropped. This could lead to biased estimates if not MCAR.
- Dropping variables
This is intuitive. If the variable that we explore has a high percentage of missing values(e.g., exceeding a threshold like 60%) then the entire variable is dropped from the dataset. Can be useful if the variable is not critical to the analysis and has a large amount of missing data.
This leads to a loss of information and can be problematic if the missing data is informative (related to other variables). It can also introduce bias if the missingness is not random.
- Mean Imputation
This method replaces missing values with the average (mean) of the observed values for that variable. This does less harm to the dataset and replacing missing values by the mean of the remaining values does not introduce a skew to that variable.
What if the variable is a categorical one? Then we use mode; the most occurred value in that variable as the replacement for missing values.
Can be inappropriate for skewed distributions where the mean is not representative of the typical value. It also assumes the missing values are randomly distributed around the mean, which may not be true. This can mask underlying patterns and underestimate the variability in the data.
- Regression Imputation
Finding the missing values becomes a regression task itself. This method involves building a regression model using the observed data to predict the missing values. The predicted values are then used to replace the missing values.
Requires choosing an appropriate model and can be complex to implement. The accuracy of the imputation depends on the quality of the model. It also assumes the relationships between variables are linear, which may not always be true.
It's a wrap for now. Handling missing data in datasets is often discussed naively. Understanding the nature of the variable that has missing values and the reason behind the values being missed is the first step to handling these missing values. The ad-hoc solutions provided need to be implemented with great keenness since it may worsen the reliability of the dataset. It is often advised to think along with the domain knowledge of the specific dataset that you are working with.
Missing you fellas till the next article.