1. Missing Data - Why does it matter so much?
Ever worked upon an analytical project and noticed the presence of blank or NAN or undefined values in the records representing the data and being in need of correctly dealing with them? This might be a routine situation while working with real world data. It becomes a crucial step to execute fair technique to handle these missing values after understanding the analysis required from the data as often data for one party can be a noise to another party. Data can be missing owing to corrupt data, incomplete data extraction process, data entry errors or simply the data is rare and is actually missing! But handling such data is of great challenge in order to make right decisions and generate robust predictive models or reports. This article sums up key steps to handle missing values using Smarten Augmented Analytics and further explains its utility from the Employee Salary Prediction dataset.
2. Just leave it or impute it!!
The best possible methods to handle missing data are:
2.1. Remove records with missing values:
It is a most common practice to delete the records from the data which contains missing values. This technique creates a robust machine learning model. As such it's elementary to remove records for which we do not have adequate information as it then doesn’t weigh much for our analysis. However, this leads to loss of data and if the amount of data missing is sky scraping, our analysis shall perform poorly. Considering the example of employee salary prediction for instance, it might be quite possible that employees belonging to say engineering team might not be disclosing their income and bonus percentage and employee team being the main determining factor for salary prediction. In such a storyline, it can be recommended to drop missing records rather than imputing it with non-realistic values.
2.2. Replace missing values:
The strategy to impute the values of variables not only reduces loss of data but also adds variance to the dataset leading to better results. This technique is admirable when the size of the data is small as in such a scenario, deletion of records will in fact compact the data leading us with lesser information for decision making.
2.2.1. Replace numeric variables with median
When it comes to replacing numeric variables with a constant value, median is a better choice as compared to mean, mode and other statistical measures as it also very well deals with skewed data and data containing outliers. When data is missing completely at random, it’s fair to say that the missing values are most likely very close to the median distribution and it is a fast strategy to complete the dataset. However, if there is a substantial amount of missing data, using this technique causes distortion in the data distribution as well as original variance.
2.2.2. Replace categorical variables with mode
When the data is missing from the categorical column, it is a good practice to replace it with mode (i.e., the most frequent category). Say out of all the team categories of employees, the most frequently occurring one is Sales. In order to prevent data loss, we can replace the missing values in the team column with Sales and take the process further. However, in case of a higher number of categories with many categories exhibiting more or less the same frequency distribution, this technique might yield poor performance.
3. Smarten Assisted Predictive Modelling: Take the Guesswork out of Planning!
Every organization must plan and forecast results. If the enterprise is to succeed, it must strive for accuracy and identify trends and patterns in the market and industry that will help it to predict future results, plan for growth and capitalize on opportunities. Smarten Insight provides predictive modeling capability and auto-recommendations and auto-suggestions to simplify use and allow business users to leverage predictive algorithms without the expertise and skill of a data scientist.
4. Above all else, show the data
Let’s gaze through the employee salary prediction dataset.
Employee Salary Prediction Dataset
It can be evident that we intend to predict the Salary of employees based upon their Gender, belonging to Senior Management or not, Team associated with as well as Bonus percentage being offered. This speaks of many missing values which need to be dealt with in the pre-processing stage itself. Also, it can be scrutinized that Bonus percentage is the only measure predictor and rest are dimensions. Let’s acquire the ability to operate such data using Smarten Augmented Analytics.
4.1. Create a fresh New Smarten Insight
Creating a new Smarten Insight.
4.2. Select the data of your interest and click NEXT
Selecting the dataset to be handled for missing values
4.3. Perform Sampling and Filtering if required and click NEXT
Sampling and Filtering using Smarten
4.4. And here we go, perform data cleaning to handle missing data
Look upon the data and make choices accordingly to perform the strategy to handle missing values. For employee salary prediction, we can safely remove missing values in the Gender, Senior Management and Team fields as the classes are more or less equally distributed and imputing it with literal mode will not be of our favorable interest. Moreover, the numeric attributes like Salary and Bonus percentage contains quite a few missing values which can hence be eliminated.
Handling missing values using Smarten
We have to learn to interrogate our data collection process, not just our algorithms! With too little data, we won’t be able to make any conclusions that can be trusted. Making replacements in the data without understanding it, will again provide us with information approaching false decision making. Hence a healthy trade-off between these two as well as understanding the reasons why data are missing is important for handling the remaining data correctly!
Note: This article is based on Smarten Version 5.2. This may or may not be relevant to the Smarten version you may be using.