Handling Text and Categorical Attributes

Handling Text and Categorical Attributes

A step of Data processing in machine learning

Do you know guys? Most of the time when we are working with machine learning models and data, we have data in numerical attributes. Suppose we have a categorical attribute in our dataset. In this case, how can someone perform any calculation and computation in the data?

Most machine learning algorithms prefer to work with numbers. So to meet this requirement, we need to convert these categories from text to numbers. Let us see this in a new way.

Problem - Is there any way to convert text to numbers as our ML model requires numerical attributes only?

Solution - Scikit-learn provides OrdinalEncoder class to convert from text to numbers.

Ordinal Encoder

Function - Encode categorical features as an integer array.

Input - The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.

Output - The features are converted to ordinal integers. This results in a single column of integers (0 to no of categories - 1) per feature.

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_

Methods

fit(X[, y])

Fit the OrdinalEncoder to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Convert the data back to the original representation.

set_output([, transform])

Set output container.

set_params(*params)

Set the parameters of this estimator.

transform(X)

Transform X to ordinal codes.

The categories_ instance variable gives the list of categories. It is a list containing 1D array of categories for each categorical attribute.

Issue with OrdinalEncoder - There is an issue with this representation of OrdinalEncoder as ML algorithms will assume that two nearby values are more similar than two distant values.

Such type of representation works good for ordered categories such as "bad", "average", "good", "excellent". But it does not work for unordered categories that does not have any relation with each other.

Fixing the issue of OrdinalEncoder - To fix this issue, a common solution is to create a binary attibute per category: one attribute equal to 1 when the category matches and 0 otherwise for all categories. This is called one-hot encoding.

One-hot Encoding

This is called one-hot encoding because only one attribute will be equal to 1 (hot), while the others will be 0 (cold). The new attributes are sometimes called dummy attributes. Scikit-learn provides a OneHotEncoder class to convert categorical values to one-hot vectors. This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with standard kernels.

Function - Encode categorical features as a one-hot numeric array.

Input - The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features.

Output - The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse_output parameter).

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
X = [['Male', 1], ['Female', 3], ['Female', 2]]
enc.fit(X)
enc.categories_

Methods

fit(X[, y])

Fit OneHotEncoder to X.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Get output feature names for transformation.

get_params([deep])

Get parameters for this estimator.

inverse_transform(X)

Convert the data back to the original representation.

set_output([, transform])

Set output container.

set_params(*params)

Set the parameters of this estimator.

transform(X)

Transform X using one-hot encoding.

Here output is a SciPy sparse matrix, instead of a NumPy array. This is very useful when we have categorical attributes with thousands of categories. After applying one-hot encoding, we get a matrix with thousands of columns which is filled with zeros and a single 1 in a column. This is not a good idea to store zeros and use tons of memory for this, so instead a sparse matrix only stores the location of the non-zero elements. It can be taken as output like a normal 2D array, but to convert it to a dense NumPy array, we need to call toarray() method.

In this way, one can deal with text and categorical attributes in a dataset and let the machine learning algorithm perform the calculation and computation to get the desired result.

That's all in this blog!! Hope you like it.

Thank you

- Akhil Soni

You can connect with me in linkedin

https://www.linkedin.com/in/akhil-soni-9827181a1/