Feature reduction is the process of selecting and transforming a subset of features from a large and complex dataset to improve the performance, interpretability, and efficiency of machine learning models. In this article, you will learn about some of the most effective feature reduction techniques in machine learning, and how to apply them in different scenarios.

Selected by the community from 26 contributions.







  

  Francky Fouedjio, Ph.D.

    

  Somesh Chatterjee

    

1 Filter methods

Filter methods are based on the statistical properties of the features, such as correlation, variance, or mutual information, to rank and filter out the irrelevant or redundant ones. Filter methods are fast, scalable, and independent of the learning algorithm, but they do not consider the interactions between the features or the predictive power of the feature subset. Some examples of filter methods are Pearson correlation, chi-square test, and ANOVA.

    Filter methods are foundational for feature reduction in machine learning.They work independently of any learning algorithm, focusing on intrinsic properties of the features.Methods like correlation coefficients and Chi-square tests evaluate the relevance of features to the target variable. The advantage? Speed and scalability, especially for large datasets. But remember, these methods don't consider feature interactions and could overlook important predictors. It's crucial to balance simplicity with predictive power. My approach often involves combining filter methods with other techniques for a more nuanced selection. Ultimately,the goal is not just to reduce dimensionality but to enhance model performance with the most relevant features.


    

  Somesh Chatterjee
    

    The curse of dimensionality! Steps like feature filtering help with the above problem.What it means is: as number of features grow, you'll need more and more data to learn patterns out of it.It's much easier to figure out a pattern in one or two features, than in hundred features, even for ML.To assist it, one of the simplest thing we can do is feature filtering. Filter out features which are similar. One way is to use correlation, if 2 features are highly correlated, then probably they contain similar info and we can get rid of one. There are other techniques like ANOVA etc.Non linear methods like: mutual info from sklearn can also be used.This is a basic technique & doesn't consider metric to be optimised or model to be used.


    

  Siddhartha Chandra
    

    Knowledge Distillation, where we train a (student) model to replicate the average behaviour of a larger (teacher) model, or an ensemble of teacher models, is one of the most effective dimensionality reduction methods for Deep Learning. It's theoretically simple to implement, and has been shown to reduce parameters by 90% without reduction in performance. Another advantage of this strategy is that no ground truth labels are needed to train the student model; the pseudo ground truth comes from the teacher models' predictions, allowing us to throw massive amounts of unlabelled data in the training pipeline, leading to improvements in generalization. Distillation can be coupled with Neural Architecture Search for a more comprehensive search.


    

    

    Feature Ranking/Scoring:In filter methods, each feature is individually scored or ranked based on some criteria. Common criteria include correlation with the target variable, variance, mutual information, and statistical tests like the chi-squared test.Selection Threshold:A selection threshold is set to determine which features should be retained and which should be discarded. Features that meet or exceed this threshold are selected, while those below it are removed.Advantages:Filter methods are fast and do not require training a machine learning model. They can be applied to high-dimensional datasets.They can help identify features that have a strong univariate relationship with the target variable.


    

  Awnish Shankar
    

    Feature selection is crucial for optimizing machine learning models by identifying the most relevant variables. Filter methods, such as correlation coefficients, Chi-square tests, and ANOVA, quickly assess feature relevance using statistical tests. They are efficient but don’t account for feature interactions. To address this, wrapper methods like forward selection and recursive feature elimination evaluate feature subsets through model performance, though they can be computationally intensive. Embedded methods, such as LASSO, integrate feature selection into the model training process, balancing efficiency and interaction consideration.


    

2 Wrapper methods

Wrapper methods use a learning algorithm to evaluate the quality of the feature subset based on a predefined objective function, such as accuracy, F1-score, or AUC. Wrapper methods are more computationally expensive than filter methods, but they can capture the interactions between the features and the target variable, and select the optimal feature subset for a specific model. Some examples of wrapper methods are recursive feature elimination, forward selection, and backward elimination.

    Feature Subset Generation:Wrapper methods generate different subsets of features from the original feature space. This can be done exhaustively by considering all possible combinations or through various search strategies, such as forward selection, backward elimination etc.Model Evaluation:For each feature subset, an ML model (e.g. a classifier or regressor) is trained & evaluated using a performance metric (e.g. accuracy, F1-score).The choice of the performance metric depends on the specific ML task (classification or regression).Selection Criterion:The performance of the model on a validation dataset (or through cross-validation) is used as the selection criterion to determine which feature subset is the most informative.


    

  Shailendra Kumar
    

    The Wrapper method is another feature reduction technique that evaluates different subsets of features by training and testing the model on each subset. It uses a specific machine learning algorithm to determine which subset of features produces the best performance. This method can be computationally expensive, especially for large datasets, but it takes into account the interactions between features and can potentially select the most relevant features for the model.


    

  Awnish Shankar
    

    Wrapper methods in feature selection are like fine-tuning a recipe—testing different combinations to find the perfect mix. These methods train a model on various feature subsets and evaluate performance using metrics like accuracy, precision, recall, F1-score, or AUC-ROC. You can start with forward selection, adding features one by one, or use backward elimination to remove the least impactful ones. Recursive feature elimination takes it further, systematically removing the weakest links. While thorough and effective, this approach can be computationally heavy, as it requires training and validating multiple models to ensure the best outcome.


    

  Sandeep Sharma
    

    Wrapper methods in machine learning, crucial for feature selection, use different tactics:Recursive Feature Elimination (RFE): Eliminates features step-by-step to find the best mix.Forward Selection: Starts with none, adds features one by one, checking performance each time.Backward Elimination: Begins with all features, then removes them to improve the model.Stepwise Regression: Combines forward and backward, adding and removing for optimal results.These methods are time-consuming but effective, tailoring your model for peak accuracy. They're like thorough auditions to find your model's star features.


3 Embedded methods

Embedded methods combine the advantages of filter and wrapper methods by incorporating the feature selection process into the learning algorithm. Embedded methods are more efficient than wrapper methods, and more accurate than filter methods, but they are specific to the learning algorithm and may not generalize well to other models. Some examples of embedded methods are LASSO, ridge regression, and decision tree.

    L1 Regularization (Lasso):L1 regularization adds a penalty term based on the absolute values of the model's coefficients to the cost function. It encourages some feature coefficients to be exactly zero, effectively eliminating those features from the model.Features with non-zero coefficients are considered the most important for the model's performance.L2 Regularization (Ridge):L2 regularization adds a penalty term based on the squared values of the model's coefficients to the cost function. While it does not eliminate features like L1 regularization, it can reduce the magnitude of less important feature coefficients.


    

  Awnish Shankar
    

    Embedded methods in feature selection are like pruning a tree as it grows, ensuring only the strongest branches flourish. These methods, like LASSO and Ridge regression, incorporate the selection process into model training, using regularization to automatically identify and keep the most relevant features while trimming away the less important ones. This approach reduces overfitting and creates a more streamlined model. The beauty of embedded methods lies in their computational efficiency—they optimize feature selection during model learning, saving time and resources. They're perfect for building models that are both accurate and efficient.


    

  Sandeep Sharma
    

    Embedded methods in machine learning are the insiders, the cool mix of brains and brawn. Regularization (like LASSO, Ridge): These are the savvy shrinkers, cutting down on over-the-top features while training.Decision Trees (like Random Forest, Gradient Boosting): These guys are tough. They pick and choose features as they build the model, real-time decision-makers.Embedded methods are slick 'cause they're part of the learning process, weeding out the weak while they're on the job. It's efficient, smart, and keeps your model from going off the rails with too much fluff. Perfect for keeping things tight and on point.

  Ammar Jawad D.
    

    In contrast to the filtering above, a set of features is run through a search algorithm that searches over a subset of features and then asks the learning algorithm to do something with them. The learning algorithm reports how well that subset does and uses that data to update the new subset of features it might look for and passes it to the algorithm. Another explanation is that embedded methods learn which features contribute to the model's accuracy while the model is being created. The most commonly used are regularisation methods—examples of regularisation algorithms are LASSO, ElasticNet, and Ridge Regression.


4 Dimensionality reduction

Dimensionality reduction is a technique that transforms the original features into a lower-dimensional space, preserving as much information as possible. Dimensionality reduction can reduce the complexity, noise, and redundancy of the data, and improve the visualization and interpretation of the features. However, dimensionality reduction may also lose some information and introduce distortion or bias to the data. Some examples of dimensionality reduction are principal component analysis, linear discriminant analysis, and autoencoder.

  Shailendra Kumar
    

    Some of the most effective feature reduction techniques include:1. Principal Component Analysis (PCA): PCA is a statistical technique that is used to reduce the dimensionality of the data by transforming the original variables into a new set of variables. 2. Recursive Feature Elimination (RFE): RFE is a technique that works by recursively removing the least important features from a model until the desired number of features is reached. This is done by training the model on the full set of features and then ranking the importance of each feature. 3. L1 Regularization (Lasso): L1 regularization is a technique that adds a penalty to the loss function of a model based on the absolute values of the coefficients of the features.


    

  Rana M. Abbaszadeh
    

    One method of reducing dimensionality is principal component analysis. Let's say you have a data set with 20 variables that you want to use to predict an outcome. Some of these variables may be redundant and not add any value to building a predictive model or developing insights. PCA, takes a look at the variables and their variances. Variables with low variances store less information, while variables with high variances store more information. From here the highest variance variables become the first, second, third principal components of your data. Using this technique, you could pare down from a 20 variable data set, to subset of a 3 variable data set that holds the same amount of predictive/insightful information.


    

  Sapan H Mankad
    

    Singular Value Decomposition (SVD) is also widely used in certain applications which aim to decompose a matrix representation of the given data into three simpler matrices, by transforming them into so-called Latent Representation. This technique, used for Latent Semantic Indexing (LSI) in NLP works with an objective to ensure that semantically similar words are represented close to each other in the new fewer dimensional space, and known as low rank approximation.


    

  (edited)

    

    Linear Discriminant Analysis (LDA):LDA is a supervised dimensionality reduction technique that focuses on maximizing class separability for classification tasks.It identifies linear combinations of features (discriminant functions) that best separate different classes in the data.Autoencoders:Autoencoders are neural network models used for unsupervised dimensionality reduction.They consist of an encoder network that maps the input data to a lower-dimensional representation (encoding) and a decoder network that attempts to reconstruct the original data from the encoding.Autoencoders can capture complex, nonlinear patterns in the data.


    

  Ammar Jawad D.
    

    Could you take a picture of your hand flat on a table? If you turned that image into black and white and reduced it from 3D to 2D, you would have applied the concept of dimensionality reduction to your image. In other words, we throw out data related to the picture (colours and dimensions) in return for a significantly reduced file size. In ML, PCA, linear discriminant analysis and autoencoders are techniques used to achieve the same goal on a dataset.


    

5 Feature extraction

Feature extraction is a technique that creates new features from the original features, based on some domain knowledge or heuristic rules. Feature extraction can enhance the meaning and relevance of the features, and reduce the dimensionality and complexity of the data. However, feature extraction may also introduce noise or errors to the data, and require more expertise and effort to design and implement. Some examples of feature extraction are polynomial features, binning, and text vectorization.

    • Report contribution

    Neural networks can also work as general feature extractors. If you train the encoder part of a network you will obtain embeddings that represent your data in fewer dimensions. This is also known as latent space.These features can then be used into a decoder network or as an input for any othe ML algorithm.


    

  Siddhartha Chandra
    

    Pruning, where we typically remove parameters from a deep learning model after training it, is another popular approach for dimensionality reduction in Deep Learning. In the presence of weight decay, during training, there are two forces acting on any model parameter: 1. the gradients coming from the loss function, and 2. the weight decay regularisation. Over the course of training, parameters that do not contribute to the loss keep getting smaller because their gradients from the loss function are small and dominated by the decay due to regularisation. Hence, a parameter's magnitude after training correlates with its contribution to the loss, and parameters with the smallest magnitudes are removed in vanilla pruning.


    

6 Feature engineering

Feature engineering is the art and science of selecting, transforming, and creating features that can improve the performance and interpretability of machine learning models. Feature engineering requires a deep understanding of the data, the problem, and the model, and involves a lot of experimentation and iteration. Feature engineering is often considered as one of the most important and challenging aspects of machine learning, and can make a significant difference in the results.

  Sandeep Sharma
    

    "Feature engineering in machine learning is like the art of a master chef. First, clean the data, scrubbing out all the grime and gunk. Then, get those features in tune – scale or normalize them for harmony. Next up, whip up some new features, like blending unique ingredients for a signature dish. Finally, cherry-pick the best ones, the flavors that really make the dish pop. This process is the secret sauce – it turns good models into great ones, making sure they hit all the right notes."


    

  Hariprasad K.


    

    Human interpretability plays a big role for understanding the problem and its goals, the features and the algorithms used and the interaction between all of these. Visualization is useful for better interpretation, by capturing the geometric and topological nature of the data. For all practical problems the data is in a high-dimensional space. Interestingly, in many problems, the data lies in a low dimensional manifold within the high dimensional space. t-SNE (mentioned by Palak Awasthi) and dimensionality reduction (PCA, kernel PCA etc) along with clustering are some ways for obtaining a lower dimensional representation that can be visualized. A presentation from NeurIPS 2018 on "Visualization for Machine Learning" is very useful.


    

7 Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

  Francky Fouedjio, Ph.D.
    

    Clustering can be used as an effective technique for feature reduction in machine learning. The idea is to group similar features together, creating a set of representative cluster centroids or cluster assignments that can be used as a reduced set of features. This strategy can simplify the data representation and improve the performance of machine learning models.


    

    

    1.Principal Component Analysis (PCA): Reduces dimensionality by transforming features into uncorrelated components.2.Feature Selection: Identifies and retains relevant features, e.g., using Recursive Feature Elimination (RFE).3.LASSO (Least Absolute Shrinkage and Selection Operator): Applies a penalty to shrink some coefficients to zero, effectively selecting features.4.t-SNE: Reduces dimensionality while preserving pairwise similarities between data points, often for visualization.5.Autoencoders: Neural network models that learn a compressed representation, reducing the number of features.


    

Machine Learning

+ Follow

Rate this article

