As someone learning data science with Python, understanding the significance of feature selection is paramount in constructing effective machine learning models. In practical data science scenarios, it's uncommon for every variable in a dataset to contribute meaningfully to model building. Including redundant variables can diminish a model's ability to generalize and potentially lower classifier accuracy. Moreover, incorporating additional variables enhances the overall complexity of the model.

As per the Principle of Parsimony, often referred to as Occam's Razor, the best way to explain a problem is with the fewest assumptions possible. This principle underscores the importance of feature selection in machine learning model development

- Understanding the significance of feature selection.
- Familiarizing with various feature selection methods.
- Putting feature selection approaches into practice and evaluating performance.

- Introduction to Feature Selection in Machine Learning

- Feature Selection Techniques in Supervised Learning
- Feature Selection Techniques in Unsupervised Learning

- Types of Feature Selection Methods:

- Filter
- Wrapper
- Embedded

Feature selection techniques in machine learning aim to identify the most effective set of features for constructing optimized models of observed phenomena.

These techniques generally fall into several categories:

These methods are suitable for labeled datasets and aim to pinpoint relevant features that enhance the model performance of supervised models such as classification and regression. Examples include linear regression, decision trees, and SVM.

These methods are applicable to data without labels. Examples include K-Means Clustering, Principal Component Analysis, and Hierarchical Clustering.

From a classification perspective, these techniques fall into categories such as filter, wrapper, embedded, and hybrid methods.

Next, we will delve into detailed explanations of these widely used feature selection methods in machine learning.

We'll be considering the Pima Indians Diabetes dataset. Please download the dataset from the provided link to follow along with the examples

Code: https://colab.research.google.com/drive/1Wxm9RGuDtV_0kBWgEjax5TZTPbKa8Kvq

Filter methods identify the inherent properties of features using univariate statistics rather than relying on cross-validation performance. These methods are quicker and less computationally intensive compared to wrapper methods. In the context of high-dimensional data, filter methods are more cost-effective computationally.

Some commonly used filter methods include:

Information gain measures the reduction in entropy when transforming a dataset. Entropy is a measure of randomness or uncertainty in the data; higher entropy means more disorder, and lower entropy means less disorder. In the context of feature selection, information gain evaluates how much knowing a feature helps in predicting the target variable. By calculating the information gain of each variable, we can determine which features reduce the uncertainty (entropy) the most and are therefore the most informative for our model.

`import pandas as pd`

from sklearn.feature_selection import mutual_info_classif

import matplotlib.pyplot as plt

%matplotlib inline

`# Load the dataset`

dataframe = pd.read_csv('/content/diabetes.csv')

`# Separate the features and the target variable`

X = dataframe.iloc[:, :-1]

`# All columns except the last one`

Y = dataframe.iloc[:, -1]

`# The last column`

# Compute the mutual information

importances = mutual_info_classif(X, Y)

`# Create a series with the feature importances`

feat_importances = pd.Series(importances, index=X.columns)

`# Plot the feature importances`

feat_importances.plot(kind='barh', color='teal')

plt.xlabel('Mutual Information')

plt.ylabel('Features')

plt.title('Feature Importances')

plt.show()

The Chi-squared test is used for selecting categorical variables in a dataset. By calculating the Chi-square statistic between each feature and the target variable, we can choose the fea_{tures wi}th the highest Chi-square scores. To properly use the Chi-square test to examine the relationship between different features and the output variable, certain conditions must be met: the variables must be categorical, independently sampled, and each value should have an expected frequency greater than 5.

`from sklearn.feature_selection import SelectKBest`

from sklearn.feature_selection import chi2

`# Convert to categorical data by converting data to integers`

X_cat = X.astype(int)

`# Three features with highest chi-squared statistics are selected`

chi2_features = SelectKBest (chi2, k = 3)

X_kbest_features = chi2_features.fit_transform(X_cat, Y)

`# Reduced features`

print('Original feature number:', X_cat.shape[1])

print('Reduced feature number:', X_kbest_features.shape[1])

Fisher's Score is a popular supervised feature selection method that ranks variables based on their Fisher score in descending order. The algorithm we use provides these ranks, allowing us to select variables accordingly for our specific case.

`!pip install skfeature-chappers`

`from skfeature.function.similarity_based import fisher_score`

import matplotlib.pyplot as plt

%matplotlib inline

`# Calculating scores`

ranks = fisher_score.fisher_score (X.values, Y.values)

`# Plotting the ranks`

feat_importances = pd.Series(ranks, dataframe.columns[0:len(dataframe.columns)-1])

feat_importances.plot(kind='barh', color='teal')

plt.show()

The correlation coefficient quantifies the linear relationship between two or more variables. It helps us predict one variable from another. Using correlation for feature selection relies on the assumption that strong variables correlate closely with the target. Additionally, features should correlate with the target but not with each other.

When two variables are correlated, one can be predicted from the other. Thus, if two features are correlated, including both in the model doesn't provide additional information. In this context, we'll apply Pearson Correlation.

Code

`import seaborn as sns`

import matplotlib.pyplot as plt

%matplotlib inline

`# Correlation matrix`

cor = dataframe.corr()

`# Plotting Heatmap`

plt.figure(figsize = (10,6))

sns.heatmap(cor, annot=True)

To proceed, we establish a threshold, such as an absolute value of 0.5, for variable selection. If predictor variables are found to be correlated, we prioritize those with higher correlation coefficients with the target variable. It's also essential to assess multiple correlation coefficients to identify multicollinearity, where several variables may correlate with each other.

The variance threshold offers a straightforward method for feature selection. It eliminates features whose variance falls below a specified threshold. By default, it filters out features with zero variance—those that have the same value across all samples. While higher-variance features are typically presumed to hold more valuable information, this method does not consider relationships between features or between features and the target variable. This limitation is a notable drawback of filter methods.

`from sklearn.feature_selection import VarianceThreshold`

`# Resetting the value of X to make it non-categorical`

X = X.iloc[:, 0:8]

`# Selecting the first 8 columns`

v_threshold = VarianceThreshold(threshold=0)

v_threshold.fit(X)

`# fit finds the features with zero variance`

v_threshold.get_support()

The get_support function returns a Boolean vector. A True value indicates that the variable does not have zero variance.

The Mean Absolute Difference (MAD) calculates the average absolute deviation of data points from their mean. Unlike variance, MAD uses absolute differences instead of squared differences. It provides a robust measure of dispersion that is less influenced by outliers. A higher MAD indicates greater variability in the data.

# Calculate MADmean_abs_diff = np.sum(np.abs(X-np.mean (X, axis=0)), axis=0)/X.shape[0]# Plot the barchartplt.bar(np.arange(X.shape[1]), mean_abs_diff, color = 'teal')

Wrappers require some method to search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. The feature selection process is based on a specific machine learning algorithm we are trying to fit on a given dataset. It follows a greedy search approach by evaluating all the possible combinations of features against the evaluation criterion. The wrapper methods usually result in better predictive accuracy than filter methods.

Let's explore some of such techniques:

This method is an iterative approach where we initially start with an empty set of features and keep adding a feature which best improves our model after each iteration. The stopping criterion is till the addition of a new variable does not improve the performance of the model.

`# Forward Feature Selection`

from sklearn.linear_model import LogisticRegression

from mlxtend.feature_selection import SequentialFeatureSelector

`lr = LogisticRegression()`

ffs = SequentialFeatureSelector(lr, k_features='best', forward=True, n_jobs=-1)

ffs.fit(X, Y)

features = list(ffs.k_feature_names_)

lr.fit(X[features], Y)

y_pred = lr.predict(X[features])

This method is also an iterative approach where we initially start with all features and after each iteration, we remove the least significant feature. The stopping criterion is till no improvement in the performance of the model is observed after the feature is removed.

`# Backward Feature Selection`

from sklearn.linear_model import LogisticRegression

from mlxtend.feature_selection import SequentialFeatureSelector

from sklearn.model_selection import train_test_split

`# Splitting the data into training and testing sets`

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

lr = LogisticRegression(class_weight='balanced', solver='lbfgs', random_state=42, n_jobs=-1, max_iter=500)

lr.fit(X, Y)bfs = SequentialFeatureSelector(lr, k_features='best', forward = False, n_jobs=-1)

bfs.fit(X, Y)

features = list(bfs.k_feature_names_)

lr.fit(X_train[features], Y_train)

y_pred = lr.predict(X_train[features])

This technique is considered as the brute force approach for the evaluation of feature subsets. It creates all possible subsets and builds a learning algorithm for each subset and selects the subset whose model performance is best.

`from sklearn.ensemble import RandomForestClassifier`

from mlxtend.feature_selection import ExhaustiveFeatureSelector

`# Create an ExhaustiveFeatureSelector object`

efs = ExhaustiveFeatureSelector(RandomForestClassifier(),min_features=4,max_features=8,scoring='roc_auc',cv=2)

`# Fit the ExhaustiveFeatureSelector object to the training data`

efs.fit(X, Y)

`# Print the selected feature indices numerically`

print("Selected feature indices:", efs.best_idx_)

`# Print the final prediction score`

print("Best score (ROC AUC):", efs.best_score_)

This greedy optimization method selects features by recursively considering the smaller and smaller set of features. The estimator is trained on an initial set of features and their importance is obtained using feature_importance_attribute. The least important features are then removed from the current set of features till we are left with the required number of features.

`# Recursive Feature Elimination`

from sklearn.feature_selection import RFE

rfe = RFE(lr, n_features_to_select=7)

rfe.fit(X_train, Y_train)

y_pred = rfe.predict(X_train)

These methods combine the advantages of the wrapper and filter methods by incorporating interactions of features while remaining computationally efficient. Embedded methods are iterative in the sense that they take care of each iteration of the model training process, carefully extracting the features that contribute the most to training for that iteration.

Let's go over some of these techniques here:

This method adds a penalty to different parameters of the machine learning model to avoid overfitting of the model. The penalty is applied over the coefficients, thus bringing down some coefficients to zero. The features having zero coefficient can be removed from the dataset.

`from sklearn.linear_model import LogisticRegression`

from sklearn.feature_selection import SelectFromModel

`# Set the regularization parameter C to 1`

logistic = LogisticRegression(C=1, penalty="l1", solver='liblinear', random_state=7).fit(X, Y)

`# Create a SelectFromModel object`

model = SelectFromModel(logistic, prefit=True)

`# Transform the original feature set`

X_new = model.transform(X)

`# Retrieve the selected feature indices based on non-zero variance`

selected_features_idx = [i for i in range(X.shape[1]) if X.iloc[:, i].var() != 0]

`# Print the selected feature indices numerically`

print("Selected feature indices:", selected_features_idx)

Random Forests is a type of bagging algorithm that combines a set number of decision trees. The tree-based strategies used by random forests are naturally ranked by how well they improve node purity, or a decrease in impurity (Gini impurity) across all trees. The nodes with the greatest decrease in impurity appear at the beginning of the trees, while the notes with the smallest decrease in impurity appear at the end. Thus, by pruning trees below a specific node, we can extract a subset of the most important features.

`import pandas as pd`

from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt

`# Create the random forest model with your hyperparameters`

model = RandomForestClassifier(n_estimators=348, random_state=7)

`# Fit the model to start training`

model.fit(X, Y)

`# Get the importance of the resulting features`

importances = model.feature_importances_

`# Create a data frame for visualization`

final_df = pd.DataFrame({"Features": X.columns, "Importances": importances})

`# Set 'Features' as the index`

final_df.set_index('Features', inplace=True)

`# Sort in ascending order for better visualization`

final_df = final_df.sort_values('Importances', ascending=False)

`# Plot the feature importances in bars`

final_df.plot.bar(color='teal')

plt.title("Feature Importances")

plt.ylabel("Importance")

plt.xlabel("Features")

plt.show()

Feature selection is a fundamental aspect of machine learning that significantly influences the performance and interpretability of your models. By carefully choosing relevant features, you not only enhance model accuracy but also reduce complexity, making your models more efficient and easier to understand.

In this blog, we explored a variety of feature selection methods, ranging from simple filter techniques to more sophisticated wrapper and embedded methods. Each approach has its unique advantages and use cases, whether you are dealing with supervised or unsupervised learning problems. Understanding when and how to apply these techniques is crucial for developing robust machine learning models.

As you continue to refine your skills in data science, remember the principle of Occam's Razor: simpler models with fewer assumptions are often more effective. By judiciously selecting features, you adhere to this principle, ensuring your models are not only powerful but also generalizable.

Incorporate these feature selection strategies into your workflow to tackle real-world datasets efficiently. Experiment with different techniques to find the optimal feature subset for your specific problem, and always validate your choices with proper evaluation metrics.

Happy coding and model building!

References:

This article was written by Karan Shah, and edited by our writers team.

🚀 "Build ML Pipelines Like a Pro!" 🔥 From data collection to model deployment, this guide breaks down every step of creating machine learning pipelines with top resources

AI/ML

Explore top AI tools transforming industries—from smart assistants like Alexa to creative powerhouses like ChatGPT and Aiva. Unlock the future of work, creativity, and business today!

AI/ML

Master the art of model selection to supercharge your machine-learning projects! Discover top strategies to pick the perfect model for flawless predictions!

AI/ML