Arpita Joshi1, 2, * Nurit Haspel2
1 Department of Computer Science, Purdue University Northwest, Hammond, IN, USA
Email: [email protected]/[email protected]
2 Department of Computer Science, University of Massachusetts, Boston, MA, USA
Email: [email protected]
* Corresponding Author: Arpita Joshi, Email: [email protected]
Representation of structurally significant data is indispensable to modern research. The need for dimensionality reduction finds its foray in varied genres viz-a-viz, Structural Bioinformatics, Machine Learning, Robotics, Artificial Intelligence, to name a few. The number of points required to effectively capture the essence of a structure is an intuitive decision. Feature reduction methods like Principal Component Analysis (PCA) have already been explored and proven to be an aid in classification and regression. In this work we present a novel approach that first performs PCA on a data set for reduction of features and then attempts to reduce the number of points itself to get rid of the points that have nothing or very little new to offer. The algorithm was tested on various kinds of data (points representing a spiral, protein coordinates, the Iris dataset prevalent in Machine Learning, face image) and the results agree with the quantitative tests applied. In each case, it turns out that a lot of data instances need not be stored to make any kind of decision. Matlab and R simulations were used to assess the structures with reduced data points. The time complexity of the algorithm is linear in the degrees of freedom of the data if the data is in a natural order.
Dimensionality Reduction, PCA, Protein Conformations, Non-linear feature reduction, Data instance reduction
Arpita Joshi and Nurit Haspel (2020). A Novel Data Instance Reduction Technique using Linear Feature Reduction. Journal of Artificial Intelligence and Systems, 2, 191–206. https://doi.org/10.33969/AIS.2020.21012.
 P. K., “On lines and planes of closest fit to systems of points in space.” Mag A, pp. 59–572, 1901.
 M. Benito and D. Pena, “A fast approach for dimensionality reduction with image data.” Pattern recognition, pp. 2400–2408, 2005.
 I. T. Jolliffe and J. Cadima, “Principal component analysis: a review and recent developments,” Philos Trans A Math Phys Eng Sci., vol. 374, 2016. [Online]. Available: www.ncbi.nlm.nih.gov/pmc/articles/PMC4792409/
 E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58(3), 2011.
 N. Locantore, J. Marron, D. Simpson, N. Tripoli, J. Zhang, and K. Cohen, “Robust principal component analysis for functional data,” Socicdad de Estadistica e Investigacion Operativa Test, vol. 8, 1999.
 P. Das, M. Moll, H. Stamati, L. Kavraki, and C. Clementi, “Low dimensional, free energy landscapes of protein folding reactions by nonlinear dimensionality reduction,” Proc. Nat. Acad. Sci., vol. 103, no. 26, pp. 9885–9890, 2006.
 A. Vajdi and N. Haspel, “A new dynamic programming algorithm for comparing gene expression data using geometric similarity,” IEEE International Conference on Bioinformatics and Biomedicine, pp. 1157 – 1161, 2016.
 N. Haspel, M. Moll, M. Baker, W. Chiu, and L. E. Kavraki, “Tracing conformational changes in proteins,” BMC Structural Biology, vol. Suppl1, p. S1, 2010.
 B. Raveh, A. Enosh, O. Furman-Schueler, and D. Halperin, “Rapid sampling of molecular motions with prior information constraints,” Plos Comp. Biol., vol. 5(2), p. e1000295, 2009.
 A. Shehu and B. Olson, “Guiding the search for native-like protein conformations with an ab-initio tree-based exploration,” The International Journal of Robotics Research, vol. 29, no. 8, pp. 1106–1127, 2010.
 I. Al-Bluwi, M. Vaisset, T. Sim´eon, and J. Cort´es, “Modeling protein conformational transitions by a combination of coarse-grained normal mode analysis and robotics-inspired methods,” BMC structural biology, vol. 13, no. Suppl 1, p. S2, 2013.
 A. Joshi and N. Haspel, “Clustering of protein conformations using parallelized dimensionality reduction,” Journal of Advances in Information Technology, 2019.
 M. Greberman, “Image processing applications: An overview,” Proc Annu Symp Comput Appl Med Care, vol. Nov 7, pp. 257–259, 1984. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2578700/
 A. Bovick, “Handbook of image and video processing,” Academic Press New York, 2000.
 D. R. Wilson and T. R. Martinez, “Reduction techniques for instance-based learning algorithms,” Machine Learning, 2000.
 A. A. Gonzalez, J.-F. Diez-Pastor, J. J. Rodriguez, and C. G. Osorio, “Instance selection of linear complexity for big data,” Knowledge Based Systems, 2016.
 S. Garcia, J. Derrac, J. Cano, and F. Herrera, “Prototype selection for nearest neighbor classification: Taxonomy and empirical study,” IEEE’s Transactions on Pattern Analysis and Machine Intelligence, pp. 417–435, 2012.
 I. Czarnowski and P. Jedrzejowicz, “Instance reduction approach to machine learning and multi-database mining,” Annales UMCS Informatica, 2006.
 S.-H. Son and J.-Y. Kim, “Data reduction for instance based learning using entropy-based partitioning,” in ICCSA: International Conference on Computational Science and Its Applications, 2006, pp. 590–599.
 A. Joshi, “High performance computing techniques to better understand protein conformational space,” Ph.D. dissertation, University of Massachusetts, Boston, 2019.
 J. Fujiki and S. Akaho, “Spherical pca with euclideanization,” ACCV’07 Workshop Subspace, November 2007.
 M. Fontes and C. Sonesan, “The projection score – an evaluation criterion for variable subset selection in pca visualization,” BMC Bioinformatics, vol. 12, p. 307, 2011.
 J. Tenenbaum, V. de Silva, and J. Langford, “A global geometric framework for nonlinear dimensionality reduction.” Science, pp. 2319–2323, 2000.
 V. D. Silva and J. B. Tenenbaum, “Global versus local methods in nonlinear dimensionality reduction,” Advances in neural information processing systems, 2003.
 J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot,
R. D. Skeel, L. Kale, and K. Schulten, “Scalable molecular dynamics with namd,” Journal of computational chemistry, vol. 26, no. 16, pp. 1781–1802, 2005.
 K. Wennerberg, K. L. Rossman, and C. J. Der, “The ras superfamily at a glance,” Journal of Cell Science, vol. 118, no. 5, pp. 843–846, 2005. [Online]. Available: http://jcs.biologists.org/content/118/5/843
 A. Joshi, N. haspel, and E. Gonzalez, “Sampling of intermediate protein conformations using efficient dimensionality reduction and topological analysis of structures,” submitted to IEEE Transactions on Computational Biology and Bioinformatics, 2020.
 D. Luo and N. Haspel, “Multi-resolution rigidity-based sampling of protein conformational paths,” pp. 787–793, September 2013.
 A. Vajdi, A. Joshi, and N. Haspel, “Integrating co-evolutionary information in monte carlo based method for proteins trajectory simulation,” in Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, ser. BCB ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 598–603. [Online]. Available: https://doi.org/10.1145/3307339.3343867
 R. C. Gonzalez and R. E. Woods, Digital Image Processing. Pearson, 2018.
 A. Joshi, A. Boyat, and B. K. Joshi, “Impact of wavelet transform and median filtering on removal of salt and pepper noise in digital images,” Proc. of International Conference on Issues and Challenges in Intelligent Computing Techniques, 2014.
 A. H. Ashtari, “Pic/plot images to coordinate points.” 2015, national University of Malaysia, Center for Artificial Intelligence Technology (CAIT), Faculty Of Information Science and Technology.
 P.-A. Cazade, W. Zheng, D. Prada-Gracia, G. Berezovska, F. Rao, C. Clementi, and M. Meuwly, “A comparative analysis of clustering algorithms: O2 migration in truncated hemoglobin i from transition networks,” The Journal of Chemical Physics, vol. 142, no. 2, pp. –, 2015. [Online]. Available: http://scitation.aip.org/content/aip/journal/jcp/142/2/10.1063/1.4904431
 R. Fisher, “The use of multiple measurements in taxonomic problems,” Annual Eugenics 7, Part II, 1936.
 B. Dasarathy, “Nosing around the neighborhood: A new system structure and classification rule for recognition in partially exposed environments,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71,, 1980.
 D. Simovici, Mathematical Analysis for Machine Learning and Data Mining. World Scientific, 2018.
 M. Fanty and R. Cole, “Spoken letter recognition.” Advances in Neural Information Processing Systems, 1991.
 D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
 S. Kirkpatrick, C. D. G. Jr., and M. P. Vecchi, “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983.
 R. Vetro, N. Haspel, and D. Simovici, “Characterizing intermediate conformations in protein conformational space,” Proc. of the Ninth International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics, July 2012.