ARTÍCULO DE INVESTIGACIÓN / RESEARCH ARTICLE

Authorship Classification in Academic and Scientific Documents: A Machine Learning-Based Approach

Clasificación de autoría en documentos académicos y científicos: un enfoque basado en aprendizaje automático

Pablo Pico-Valencia*

Sahory Maila-Herrera**

* Universidad de Granada (España). Software Engineering Department, Research Centre for Information and Communication Technologies (CITIC-UGR). Ph.D. in Information and Communication Technologies (ICT). Orcid-ID: https://orcid.org/0000-0003-3518-3313. pablo.pico@ugr.es

** Pontificia Universidad Católica del Ecuador Sede Esmeraldas. Systems and Computing Engineering. Eng. in System and Computing.

Orcid-ID: https://orcid.org/0009-0007-1702-9749. sahory.maila@pucese.edu.ec

Corresponding: Pablo Pico-Valencia, “Periodista Daniel Saucedo Aranda” Street, 18071, Granada (España).

Abstract

This paper presents a machine learning-based system that incorporates text mining to analyze and classify writing styles in scientific reports authored by faculty members at the Pontificia Universidad Católica del Ecuador, Esmeraldas. The system aims to enhance academic integrity by identifying potential cases of false authorship. A dataset of research papers written in Spanish by faculty professors was processed using TF-IDF (Term Frequency-Inverse Document Frequency) and Word Embeddings for feature extraction. To assess classification performance, seven supervised learning models were tested: Linear Support Vector Classifier (SVC), SVC with RBF kernel, Random Forest, Decision Tree, Logistic Regression, k-Nearest Neighbors (k-NN), and Naïve Bayes. The Logistic Regression model yielded the highest accuracy (89.62%), closely followed by Linear SVC (87.36%) and RBF SVC (86.59%), outperforming tree-based and probabilistic methods with statistical significance (p<0.05). The Wilcoxon test showed no significant performance differences among the best classifiers, confirming their reliability in authorship attribution. The findings highlight the promise of incorporating writing style analysis into institutional systems to enhance conventional methods for detecting plagiarism.

Keywords: data mining, machine learning, natural language processing, prediction, writing style.

Resumen

Este artículo presenta un sistema basado en aprendizaje automático que implementa minería de texto para analizar y clasificar estilos de escritura en informes científicos elaborados por docentes de la Pontificia Universidad Católica del Ecuador, sede Esmeraldas. El objetivo del sistema es fortalecer la integridad académica mediante la identificación de posibles casos de autoría falsa. Se procesó un conjunto de datos compuesto por artículos de investigación redactados en español por profesores universitarios, aplicando TF-IDF (Frecuencia de Término - Frecuencia Inversa de Documento) y Word Embeddings para la extracción de características. Para evaluar el rendimiento en la clasificación, se probaron siete modelos de aprendizaje supervisado: Clasificador Lineal de Vectores de Soporte (SVC), SVC con kernel RBF, Random Forest, árbol de Decisión, Regresión Logística, k-Vecinos más Cercanos (k-NN) y Naïve Bayes. El modelo de Regresión Logística obtuvo la mayor precisión (89.62 %), seguido de cerca por el SVC Lineal (87.36 %) y el SVC RBF (86.59 %), superando con significancia estadística a los métodos basados en árboles y probabilísticos (p<0.05). La prueba de Wilcoxon no mostró diferencias significativas en el rendimiento entre los mejores clasificadores, lo que confirma su fiabilidad en la atribución de autoría. Los hallazgos subrayan el potencial de incorporar el análisis del estilo de escritura en los sistemas institucionales para mejorar los métodos convencionales de detección de plagio.

Palabras clave: aprendizaje automático, estilo de redacción, minería de datos, predicción, procesamiento del lenguaje natural.

INTRODUCTION

Text mining enables machines to efficiently search and extract valuable information from text documents [1]. This is achieved by identifying characteristic patterns in the natural language used in these documents. Machines can now acquire explicit and structured information through text mining [2]. However, interpreting deeper meaning—as human do—remains a significant challenge [1].

Additionally, text mining applications are diverse and span multiple fields. In security, for example, text mining can anticipate and counteract terrorist activities. It does this by identifying connections among individuals and entities, and by analyzing patterns in social and economic behavior [3], [4]. In politics, it has been used to analyze public opinion on social networks such as X (formerly Twitter) [5]. In business and marketing, text mining helps evaluate customer feedback and comments. It goes beyond simple sentiment classification (positive, neutral, or negative) to identify customer needs, analyze product reviews, and track emerging trends [6]. For instance, it has been applied to improve and/or realign services in the tourism sector based on customer reviews [7]. It is also used to identify key attributes in banking services and perform customer sentiment analysis from online user reviews [8], and to enhance healthcare service design and delivery in hospitals by leveraging customer satisfaction metrics, which vary significantly depending on the service context [9].

Moreover, text mining has important applications in education and healthcare. In education, it is increasingly combined with machine learning to automate the extraction and classification of bibliographic materials in online learning environments. This enables personalized learning experiences and the early identification of students who may require additional support [10], [11]. Additionally, text mining is employed to detect plagiarism in student essays and assess writing styles [12]. In healthcare, it is utilized in biomedical and clinical settings to analyze clinical data, identify potential drug interactions, and track disease outbreaks [13]. Overall, text mining serves as a valuable tool for extracting and interpreting text from various sources, including books, articles, reports, theses, websites, and social networks.

The global pandemic, technological advancements, and the increasing demand for online education have driven a shift toward telematics-based work and study models. This transformation has led education to move from traditional classrooms to virtual learning environments, which facilitate digital content delivery and enable interaction with students worldwide [14]. However, the ease of accessing information online has contributed to a rise in plagiarism among students. Common practices include copying and pasting verbatim text, paraphrasing without proper citation, and using online tools to bypass originality checks. These behaviors hinder the development of critical thinking, analytical reasoning, and writing skills—key competencies for academic success. This issue is particularly concerning in assignments such as undergraduate and master’s theses, where originality and independent research are paramount.

Two primary ethical concerns affect academic writing, particularly in undergraduate and postgraduate reports: plagiarism and false authorship. Plagiarism occurs when content is copied from existing sources without proper citation, often assessed by measuring textual similarity to existing databases. Universities typically establish thresholds for acceptable similarity levels (e.g., a maximum of 15% similarity without citation). False authorship, on the other hand, involves presenting someone else’s work as one’s own, either by commissioning it or by using another person’s ideas without proper credit. Both issues undermine the integrity of academic research.

To address plagiarism, universities employ similarity detection tools such as Turnitin, which compare submitted work against existing sources. However, detecting false authorship—such as contract cheating—is more complex and difficult to prove, even when suspected. This raises a critical research question: Can an artificial intelligence (AI) tool based on text mining and machine learning classify writing styles in academic reports to identify potential cases of false authorship?

This study aims to develop a predictive system that analyzes content to classify scientific writing styles in digital academic reports. Although this research is motivated by concerns regarding student theses, it utilizes reports and papers authored by professors at the Pontificia Universidad Católica del Ecuador Sede Esmeraldas (PUCESE) to train the AI model. The model is designed for future application to student reports, with the potential for expansion to undergraduate and graduate theses. Due to the lack of historical reports tracking students’ academic progress, this study does not involve actual student cases. However, the classifier’s logic can be applied similarly to student papers. By analyzing professors’ manuscripts, this research establishes a foundation for addressing academic integrity issues in student work.

PUCESE stands to benefit from validating not only textual similarity but also the authenticity of writing styles in academic reports. This novel system holds promise for universities worldwide, with PUCESE’s Academic and Research Department, administrative bodies, and faculty advisors as the primary beneficiaries.

This paper is organized as follows: Section 2 presents the theoretical framework of the study. Section 3 introduces authorship attribution and machine learning-based classification systems. Section 4 describes the system’s design. Section 5 details the training results of the classifier in terms of the accuracy metric. Finally, Section 6 presents the conclusions and future work.

THEORETICAL BACKGROUND

Attribution of Authorship

Authorship analysis, a field of growing interest, has made significant contributions to areas such as homeland security, intelligence, and market analysis [15]. It focuses on the automatic classification of texts based on authors’ writing styles, encompassing tasks such as authorship attribution (identifying the author) and plagiarism detection. This study leverages the concept of authorship analysis, specifically authorship attribution, to identify potential cases of false authorship in academic reports.

Within authorship analysis, two primary approaches have been proposed for determining authorship based on writing style. The first approach, known as the profile-based method, aggregates all of an author’s documents into a single dataset for training. This method creates a characteristic profile of the author’s writing style, as illustrated in Figure 1 from Ramírez et al. [16]. However, it requires a substantial volume of documents from a single author, which may not always be available.

The second approach, instance-based, focuses on individual documents. Each document is transformed into a vector representation, from which features are extracted to train the model. This method enables the model to predict the authorship of unknown texts. While Ramírez et al. [16] suggested using a single document, such as an abstract, to capture an author’s writing style, our study proposes leveraging a larger set of documents to provide a more comprehensive representation of each author’s writing style.