Data Preprocessing - Handling Missing Values

Data Preprocessing - Handling Missing Values

Methods to handle the missing values in datasets

In this blog, I would like to cover methods with which one can deal with the missing cells or values in a dataset. You must have seen the error while dealing with any dataset that this data is missing or if it is worth using that missing value or not. There are many methods by which one can deal with those values in the dataset. This is a step of data cleaning in data preprocessing and making it ready for models to train with those data and succeed in the result.

There are some pre-defined methods and functions and modules that help us to deal with such datasets and would help make this process complete by handling out all the missing values, either by filling or by removing or by some other techniques. In such a way we can use different methods.

Let us see what different options we have to deal with those missing values.

  • Get rid of corresponding rows

  • Get rid of the whole attribute

  • Set the values to some value (zero, mean, median etc)

We will discuss all the options one by one

  1. Get rid of corresponding rows

    One way to deal with missing values in a dataset is to delete that whole row with that column that has an empty value in it. Now we have a dataset in the form of a Dataframe of the pandas library. To remove corresponding rows, pandas provide the following function:

    dropna() - This function removes any row that contains at least one missing value. It removes all the rows from a dataset that contains missing value(s).

    dropna(subset=[column name(s)]) - This function removes the rows in which the values of that specific column are missing which is mentioned in the subset parameter of dropna function.

     import pandas as pd
    
     data = pd.read_csv('data.csv')
     data = data.dropna()
    
     print(data)
    
  2. Get rid of the whole attribute

    If that attribute does not impact much to your observation and is of less use then it is better to remove that attribute rather than keeping it as an extra feature to add complexity during computations.

    drop() function is used to remove the complete attribute or column from the pandas dataset. It can be removed by specifying the column name as first argument and the axis parameter as 1.

     import pandas as pd
    
     data = pd.read_csv('data.csv')
     data = data.drop('attribute_name', axis=1)
    
     print(data)
    

    In this example, we load a dataset from a CSV file and use drop() to remove an entire attribute (column) from the DataFrame. The resulting DataFrame contains all rows but without the specified column.

  3. Set the values to some value (zero, mean, median etc)

    In this option, we will fill that attribute's missing value with some other values which can compensate with other values for the model and can be used for inference accordingly. As per the requirement, the user can fill those missing values with mean, median, zero or other values or parameters. Filling the missing values with other values can help in gaining insights and will not let the user lose the data so as to train the model with more training data. To fill in the missing values, we have fillna() function of the pandas library. To fill the missing values, give the first argument as the value to be filled in place of the missing value in the fillna() function. The inplace=True parameter is used to modify the DataFrame in place.

     import pandas as pd
    
     data = pd.read_csv('data.csv')
     median = data['column_name'].median()
     data['column_name'].fillna(median, inplace=True)
    
     print(data)
    

    In this example, we load a dataset from a CSV file and use fillna() to replace any missing values in the specified column with the median value of that column. The resulting DataFrame contains all rows with the specified column containing no missing values.

    To take care of missing values, Scikit-learn provides a handy class which is SimpleImputer. First we need to create an instance of SimpleImputer with the strategy parameter which is going to be used to fill the values in missing places in the corresponding columns.

    After this, we fit all the numerical attributes of the dataset in that instance of the SimpleImputer class. The imputer will simply computer the median or mean whatever strategy provided for each attribute and store the results in the statistics_ instance variable. Now we can use this imputer to transform the training set by replacing all the missing values with the median or any other parameter of their corresponding column.

     from sklearn.impute import SimpleImputer
     import pandas as pd
    
     data = pd.read_csv('data.csv')
     imputer = SimpleImputer(strategy='mean')
     imputer.fit(data)
     data['column_name'] = imputer.transform(data[['column_name']])
    
     print(data)
    

    Here we have used imputer as an instance of SimpleImputer and then fit on the data and finally transform the data of the column and fill the missing values.

With these all techniques and methods, one can easily deal with the missing values of the dataset and can perform the step of data cleaning also which is an important step while data preprocessing for training the model so that it would help in producing a better model which can later generate better results.

That's all in this blog.

Thank you!

Akhil Soni

You can connect with me - Linkedin