| Peer-Reviewed

Dimensionality Reduction of Data with Neighbourhood Components Analysis

Received: 9 April 2022    Accepted: 25 April 2022    Published: 10 May 2022
Views:       Downloads:
Abstract

In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method.

Published in International Journal of Data Science and Analysis (Volume 8, Issue 3)
DOI 10.11648/j.ijdsa.20220803.11
Page(s) 72-81
Creative Commons

This is an Open Access article, distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution and reproduction in any medium or format, provided the original work is properly cited.

Copyright

Copyright © The Author(s), 2024. Published by Science Publishing Group

Keywords

Dimensionality Reduction, Neighbourhood Components Analysis (NCA), Principal Component Analysis (PCA), Outlier Detection

References
[1] W. Yang, K. Wang and W. Zuo, "Neighborhood component feature selection for high-dimensional data.," JCP, vol. 7, p. 161–168, 2012.
[2] D. M. Hawkins, Identification of outliers, vol. 11, Springer, 1980.
[3] C. Qin, S. Song and G. Huang, "Non-linear neighborhood component analysis based on constructive neural networks," in 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2014.
[4] A. Datta, S. Ghosh and A. Ghosh, "PCA, kernel PCA and dimensionality reduction in hyperspectral images," in Advances in Principal Component Analysis, Springer, 2018, p. 19–46.
[5] X. Wu, X. Zhu, G.-Q. Wu and W. Ding, "Data mining with big data," IEEE transactions on knowledge and data engineering, vol. 26, p. 97–107, 2013.
[6] J. Fan, F. Han and H. Liu, "Challenges of big data analysis," National science review, vol. 1, p. 293–314, 2014.
[7] O. Shetta and M. Niranjan, "Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality," Royal Society open science, vol. 7, p. 190714, 2020.
[8] S. Roweis, G. Hinton and R. Salakhutdinov, "Neighbourhood component analysis," Adv. Neural Inf. Process. Syst.(NIPS), vol. 17, p. 513–520, 2004.
[9] K. Pearson, "LIII. On lines and planes of closest fit to systems of points in space," The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, vol. 2, p. 559–572, 1901.
[10] H. Hotelling, "Analysis of a complex of statistical variables into principal components.," Journal of educational psychology, vol. 24, p. 417, 1933.
[11] I. T. Jolliffe, "Principal components in regression analysis," Principal component analysis, p. 167–198, 2002.
[12] W. Astuti and others, "Support vector machine and principal component analysis for microarray data classification," in Journal of Physics: Conference Series, 2018.
[13] G. T. Reddy, M. P. K. Reddy, K. Lakshmanna, R. Kaluri, D. S. Rajput, G. Srivastava and T. Baker, "Analysis of dimensionality reduction techniques on big data," IEEE Access, vol. 8, p. 54776–54788, 2020.
[14] N. Singh-Miller, M. Collins and T. J. Hazen, "Dimensionality reduction for speech recognition using neighborhood components analysis," in Eighth Annual Conference of the International Speech Communication Association, 2007.
[15] N. Singh-Miller, "Neighborhood analysis methods in acoustic modeling for automatic speech recognition," 2010.
[16] J. Manit and P. Youngkong, "Neighborhood components analysis in sEMG signal dimensionality reduction for gait phase pattern recognition," in 7th International Conference on Broadband Communications and Biomedical Applications, 2011.
[17] M. Rizwan and D. V. Anderson, "Speaker similarity score based fast phoneme classification by using neighborhood components analysis," in 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), 2016.
[18] H. Ferdinando, T. Seppänen and E. Alasaarela, "Emotion recognition using neighborhood components analysis and ECG/HRV-based features," in International Conference on Pattern Recognition Applications and Methods, 2017.
[19] H. Ferdinando and E. Alasaarela, "Enhancement of emotion recogniton using feature fusion and the neighborhood components analysis," 2018.
[20] G. R. Naik, Advances in Principal Component Analysis: Research and Development, Springer, 2017.
[21] C. O. Sakar, G. Serbes, A. Gunduz, H. C. Tunc, H. Nizam, B. E. Sakar, M. Tutuncu, T. Aydin, M. E. Isenkul and H. Apaydin, "A comparative analysis of speech signal processing algorithms for Parkinson’s disease classification and the use of the tunable Q-factor wavelet transform," Applied Soft Computing, vol. 74, p. 255–263, 2019.
[22] V. Fritsch, G. Varoquaux, B. Thyreau, J.-B. Poline and B. Thirion, "Detecting outlying subjects in high-dimensional neuroimaging datasets with regularized minimum covariance determinant," in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2011.
Cite This Article
  • APA Style

    Hannah Kariuki, Samuel Mwalili, Anthony Waititu. (2022). Dimensionality Reduction of Data with Neighbourhood Components Analysis. International Journal of Data Science and Analysis, 8(3), 72-81. https://doi.org/10.11648/j.ijdsa.20220803.11

    Copy | Download

    ACS Style

    Hannah Kariuki; Samuel Mwalili; Anthony Waititu. Dimensionality Reduction of Data with Neighbourhood Components Analysis. Int. J. Data Sci. Anal. 2022, 8(3), 72-81. doi: 10.11648/j.ijdsa.20220803.11

    Copy | Download

    AMA Style

    Hannah Kariuki, Samuel Mwalili, Anthony Waititu. Dimensionality Reduction of Data with Neighbourhood Components Analysis. Int J Data Sci Anal. 2022;8(3):72-81. doi: 10.11648/j.ijdsa.20220803.11

    Copy | Download

  • @article{10.11648/j.ijdsa.20220803.11,
      author = {Hannah Kariuki and Samuel Mwalili and Anthony Waititu},
      title = {Dimensionality Reduction of Data with Neighbourhood Components Analysis},
      journal = {International Journal of Data Science and Analysis},
      volume = {8},
      number = {3},
      pages = {72-81},
      doi = {10.11648/j.ijdsa.20220803.11},
      url = {https://doi.org/10.11648/j.ijdsa.20220803.11},
      eprint = {https://article.sciencepublishinggroup.com/pdf/10.11648.j.ijdsa.20220803.11},
      abstract = {In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method.},
     year = {2022}
    }
    

    Copy | Download

  • TY  - JOUR
    T1  - Dimensionality Reduction of Data with Neighbourhood Components Analysis
    AU  - Hannah Kariuki
    AU  - Samuel Mwalili
    AU  - Anthony Waititu
    Y1  - 2022/05/10
    PY  - 2022
    N1  - https://doi.org/10.11648/j.ijdsa.20220803.11
    DO  - 10.11648/j.ijdsa.20220803.11
    T2  - International Journal of Data Science and Analysis
    JF  - International Journal of Data Science and Analysis
    JO  - International Journal of Data Science and Analysis
    SP  - 72
    EP  - 81
    PB  - Science Publishing Group
    SN  - 2575-1891
    UR  - https://doi.org/10.11648/j.ijdsa.20220803.11
    AB  - In most research fields, the amount of data produced is growing very fast. Analysis of big data offers potentially unlimited opportunities for information discovery. However, due to the high dimensions and presence of outliers, there is a need for a suitable algorithm for dimensionality reduction. By performing dimensionality reduction, we can learn low dimensional embeddings which capture most of the variability in data. This study proposes a new approach, Neighbourhood Components Analysis (NCA) a nearest-neighbor-based non-parametric method for learning low-dimensional linear embeddings of labeled data. This means that the approach uses class labels to guide the dimensionality reduction (DR) process. Neighborhood Components Analysis learns a low-dimensional linear projection of the feature space to improve the performance of a nearest neighbour classifier in the projected space. The method avoids making parametric assumptions about the data and therefore, can work well with complex or multi-modal data, which is the case with most real-world data. We evaluated the efficiency of our method for dimensionality reduction of data by comparing the classification errors and class separability of the embedded data with that of Principal Component Analysis (PCA). The result shows a significant reduction in the dimensions of the data from 754 to 55 dimensions. Neighborhood Components Analysis outperformed Principal Components Analysis in classification error across a range of dimensions. Analysis conducted on real and simulated datasets showed that the proposed algorithm is generally insensitive to the increase in the number of outliers and irrelevant features and consistently outperformed the classical Principal Component Analysis method.
    VL  - 8
    IS  - 3
    ER  - 

    Copy | Download

Author Information
  • Department of Statistics and Actuarial Science, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Department of Statistics and Actuarial Science, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Department of Statistics and Actuarial Science, Jomo Kenyatta University of Agriculture and Technology, Nairobi, Kenya

  • Sections