Authorship Classification in Academic and Scientific Documents: A Machine Learning-Based Approach
DOI:
https://doi.org/10.14482/inde.44.01.215.568Keywords:
Machine learning, data mining, natural processing language, writing style, predictionAbstract
This paper presents a machine learning-based system that incorporates text mining to analyze and classify writing styles in scientific reports authored by faculty members at the Pontificia Universidad Católica del Ecuador, Esmeraldas. The system aims to enhance academic integrity by identifying potential cases of false authorship. A dataset of research papers written in Spanish by faculty professors was processed using TF-IDF (Term Frequency-Inverse Document Frequency) and Word Embeddings for feature extraction. To assess classification performance, seven supervised learning models were tested: Linear Support Vector Classifier (SVC), SVC with RBF kernel, Random Forest, Decision Tree, Logistic Regression, k-Nearest Neighbors (k-NN), and Naïve Bayes. The Logistic Regression model yielded the highest accuracy (89,62%), closely followed by Linear SVC (87,36%) and RBF SVC (86,59%), outperforming tree-based and probabilistic methods with statistical significance (p < 0.05). The Wilcoxon test showed no significant performance differences among the best classifiers, confirming their reliability in authorship attribution. The findings highlight the promise of incorporating writing style analysis into institutional systems to enhance conventional plagiarism detection methods.
References
A. Korkmaz, C. Aktürk, and T. Talan, “Analyzing the User’s Sentiments of ChatGPT Using Twitter Data -,” Iraqi J. Comput. Sci. Math., vol. 4, no. 2, pp. 202–214, 2023.
A. Arias, Y. Mattos, J. Heredia, and D. Heredia, “Minería de texto como una herramienta para la búsqueda de artículos científicos para la investigación,” Rev. I+D en TI, vol. 7, no. 1, pp. 14–20, 2017.
A. Zanasi, “Virtual Weapons for Real Wars: Text Mining for National Security,” in Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS’08, 2009, vol. 53, pp. 53–60.
R. Bridgelall, “An Application of Natural Language Processing to Classify What Terrorists Say They Want,” Soc. Sci., vol. 11, no. 1, pp. 1–15, 2022.
Jufri and M. Thamrin, “Political Influence Analysis Social Media Text Mining for Public Opinion: Case Study Makassar City,” in 2021 3rd International Conference on Cybernetics and Intelligent System (ICORIS), 2021, pp. 1–5.
M. Gamon, A. Aue, S. Corston-Oliver, and E. Ringger, “Pulse: Mining Customer Opinions from Free Text,” in Advances in Intelligent Data Analysis VI. IDA 2005, 2005, pp. 121–132.
S. Jardim and C. Mora, “Customer reviews sentiment-based analysis and clustering for market-oriented tourism services and products development or positioning,” Procedia Comput. Sci., vol. 196, no. 2021, pp. 199–206, 2021.
D. Mittal and S. R. Agrawal, “Determining banking service attributes from online reviews: text mining and sentiment analysis,” Int. J. Bank Mark., vol. 40, no. 3, pp. 558–577, 2022.
S. Chatterjee, D. Goyal, A. Prakash, and J. Sharma, “Exploring healthcare/health-product ecommerce satisfaction: A text mining and machine learning application,” J. Bus. Res., vol. 131, no. October 2020, pp. 815–825, 2021.
M. C. Barrera, “Minería de texto en la clasificación de material bibliográfico,” Biblios, no. 64, pp. 33–43, 2016.
R. Ferreira-Mello, M. André, A. Pinheiro, E. Costa, and C. Romero, “Text mining in education,” Wiley Interdiscip. Rev. Data Min. Knowl. Discov., vol. 9, no. 6, 2019.
J. Villalón, P. Kearney, R. A. Calvo, and P. Reimann, “Glosser: Enhanced feedback for student writing tasks,” in Proceedings - The 8th IEEE International Conference on Advanced Learning Technologies, ICALT 2008, 2008, no. 1, pp. 454–458.
E. Hossain et al., “Natural Language Processing in Electronic Health Records in relation to healthcare decision-making: A systematic review,” Comput. Biol. Med., vol. 155, pp. 1–24, 2023.
G. Aciar, S. Aciar, and C. González, “Analítica del aprendizaje: método automático para identificar sentencias que contienen información positiva y negativa utilizando técnicas de minería de texto,” in VIII Jornadas Internacionales de Campus Virtuales (JICV’18), 2018.
V. Mercado, A. Villagra, and M. Errecalde, “El Proceso de Extracción de Conocimiento en la Determinación del Perfil del Autor y la Atribución de Autoría,” in XIX Workshop de Investigadores en Ciencias de la Computación (WICC 2017, ITBA, Buenos Aires), 2017, pp. 261–265.
M. Ramírez, J. Carillo, and M. Somodevilla, “Atribución de autoría combinando información léxico-sintáctica mediante representaciones holográficas reducidas,” Res. Comput. Sci., vol. 88, pp. 103–113, 2014.
K. Thakur and V. Kumar, “Application of Text Mining Techniques on Scholarly Research Articles: Methods and Tools,” New Rev. Acad. Librariansh., vol. 28, no. 3, pp. 279–302, 2022.
I. Valero, “Técnicas estadísticas en Minería de Textos,” Universidad de Sevilla, 2017.
A. A. Jalal and B. H. Ali, “Text documents clustering using data mining techniques,” Int. J. Electr. Comput. Eng., vol. 11, no. 1, pp. 664–670, 2021.
S. Selva Birunda and R. Kanniga Devi, A review on word embedding techniques for text classification, vol. 59. Springer Singapore, 2021.
M. Ruiz, “Implementación de un sistema de diálogo automático como asistente en el proceso administrativo del examen de traductor e intérprete oficial de la Universidad de Antioquia,” Universidad de Antioquia, 2020.
G. Liberatore, A. Vuotto, and G. Fernández, “Desarrollo de una herramienta para el análisis y representación semántica de colecciones documentales a través del factor TF-IDF,” in Jornadas Temas Actuales en Bibliotecología, 2018.
A. Cardoso, L. Talame, M. Amor, and A. Monge, “Aplicación de técnicas avanzadas de aprendizaje automático para identificar emociones en textos,” in XXIII Workshop de Investigadores en Ciencias de la Computación, 2021, pp. 73–77.
G. Naidu, T. Zuva, and E. M. Sibanda, A Review of Evaluation Metrics in Machine Learning Algorithms, vol. 724 LNNS. Springer International Publishing, 2023.
S. Sarica and J. Luo, “Stopwords in technical language processing,” PLoS One, vol. 16, no. 8 August, pp. 1–13, 2021.
Z. Abidin, A. Junaidi, and Wamiliana, “Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review,” J. Inf. Syst. Eng. Bus. Intell., vol. 10, no. 2, pp. 217–231, 2024.
P. Pico-Valencia, O. Vinueza-Celi, and J. A. Holgado-Terriza, “Bringing Machine Learning Predictive Models Based on Machine Learning Closer to Non-technical Users,” in Advances in Intelligent Systems and Computing, 2021, vol. 1273 AISC, pp. 3–15.
D. G. Pereira, A. Afonso, and F. M. Medeiros, “Overview of Friedman’s Test and Post-hoc Analysis,” Commun. Stat. - Simul. Comput., vol. 44, no. 10, pp. 2636–2653, 2015.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Revista Científica Ingeniería y Desarrollo.

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




