EVALUATION OF A PROCESS FOR THE EXPERIMENTAL DEVELOPMENT OF DATA MINING, AI AND DATA SCIENCE APPLICATIONS ALIGNED WITH THE STRATEGIC PLANNING

Colaço Júnior, Methanias; Cruz, Rodrigo Fontes; Araújo, Luciano Vieira de; Bliacheriene, Ana Carla; Nunes, Fátima de L. S.

doi:10.4301/S1807-1775202219018

Acessibilidade / Reportar erro

Brasil

JISTEM - Journal of Information Systems and Technology Management

Español English

Brasil

Español English

sumário « anterior atual seguinte »

Sumário

Original Article • JISTEM J.Inf.Syst. Technol. Manag. 19 • 2022 • https://doi.org/10.4301/S1807-1775202219018 copy

EVALUATION OF A PROCESS FOR THE EXPERIMENTAL DEVELOPMENT OF DATA MINING, AI AND DATA SCIENCE APPLICATIONS ALIGNED WITH THE STRATEGIC PLANNING

Authorship SCIMAGO INSTITUTIONS RANKINGS

ABSTRACT

The Big Data phenomenon has imposed maturity on companies regarding the exploration of their data, as a prerogative to obtain valuable insights into their clients and the power of analysis to guide decision-making processes. Therefore, a general approach that describes how to extract knowledge for the execution of the business strategy needs to be established. The purpose of this research paper is to introduce and evaluate the implementation of a process for the experimental development of Data Mining (DM), AI and Data Science applications aligned with the strategic planning. A case study with the proposed process was conducted in a federal educational institution. The results generated evidence showing that it is possible to integrate a strategic alignment approach, an experimental method, and a methodology for the development of DM applications. Data Mining (DM) and Data Science (DS) applications also present the risks of other Information Systems, and the adoption of strategy-driven and scientific method processes are critical success factors. Moreover, it was possible to conclude that the application of the scientific method was facilitated, besides being an important tool to ensure the quality, reproducibility and transparency of intelligent applications. In conclusion, the process needs to be mapped to foment and guide the strategic alignment.

Keywords:
Big Data; Strategic Alignment; Experimentation; Small Data; Reproducibility

Activity: Define Data Mining Goal
Goal	Defining the overall mining goal, attached to the strategic planning.
Inputs	The GQM+Strategies grid or element of another methodology, and Priority goal of the client.
Subactivities	Identify the organizational goals; Select the strategic goal in the grid; Select questions and metrics regarding the strategic goal; Write the overall increment goal based on the GQM questions; Review and Adjust.
Results (Outputs)	Preliminary scope and selected grid.

GOAL
<Describing the strategic goal that the project aims to meet. It corresponds to an anticipated future state that a business organization aims to reach. Answer the question: “What must be reached?>
FROM THE VIEWPOINT OF THE BUSINESS
GOAL	PURPOSE	QUALITY FOCUS	VIEWPOINT	CONTEXT
< Object of Analysis >	< Purpose of the Project >	< What are the possible metrics to measure the desired goal, according to the members of the project? >	< Stakeholders, people who will influence the output of the analysis >	< Target Audience >
FROM THE VIEWPOINT OF MINING
GOAL	PURPOSE	QUALITY FOCUS	VIEWPOINT	CONTEXT
< Object of Analysis >	< Purpose of the mining >	<Predictability that is intended to reach, efficacy, efficiency, etc. >	< Stakeholders, people who will influence the output of the analysis >	< Context and suppositions >
QUALITY FOCUS (QUESTIONS AND METRICS)			VARIATION FACTORS
< What are the possible metrics to measure the object of interest, according to the members of the project? >			< What contextual factors will influence the metrics, according to the expectation of a member of the project? Important information for the comprehension of the baseline hypotheses can be provided. >
BASELINE HYPOTHESES			IMPACT ON BASELINE HYPOTHESES
< What is the current knowledge of the members of the project regarding the metrics? Can it be available from the actual data of previous projects? Or can it represent any type of opinion from an expert, in other words, suppositions about what can be true? >			< How can the variation factors influence the actual measurements? What type of dependence between the metrics and the influence factors are assumed? What other data, which are necessary to interpret the model and the metrics, does it provide information on? >
MODEL INTERPRETATION
< Interpretation of the described goals. >
OVERALL GOAL WITH PRELIMINARY VIZUALIZATION SCOPE
< Description of the overall goal with scope delimitation and expected final result. For the formalization of the goal, the use of the GQM model below is recommended. It is also recommended to add an outline of the expected result, in a manner to enable an easy understanding of the problem. For this reason, an initial prototype of the mining process outputs and User Stories can be used, according to the example shown in the case study. > A summary of the goal can be described like this: ANALYZING <Object of Study>, IN ORDER TO <Goal>, REGARDING THE <Approach>, FROM THE VIEWPOINT OF <Stakeholders>, IN THE CONTEXT OF <Context>.

Activity: Prepare Data
Goal	Building the Dataset(s)
Inputs	Preliminary scope and selected grid.
Subactivities	Pre-select the Data; Supervise the Database; Balance the Database; Normalize the Database.
Results (Outputs)	Optimized Dataset for the mining process.

Attribute	Unit	Final Domain	Description
Attribute 1	Unit of Attribute 1	Domain of Attribute 1	Description of Attribute 1
Attribute 2	Unit of Attribute 2	Domain of Attribute 2	Description of Attribute 2
Attribute n	Unit of Attribute n	Domain of Attribute n	Description of Attribute n

Activity: Design Model
Goal	Building Data Mining Model.
Inputs	Optimized Dataset.
Subactivities	Select Attributes; Select Algorithms; Transform Data; Define Parameters.
Results (Outputs)	Selected Algorithms and Attributes; Necessary Parameterization and Data Transformation.

Algorithm	Attribute	Parameters	Transformation
Algorithm 1	Attribute 1, Attribute 2, Attribute n	P1: Value, P2: Value, Pn: Value	In case of transformation, fill in with the transformed attributes and the applied formula.
Algorithm 2	Attribute 1, Attribute 2, Attribute n	P1: Value, P2: Value, Pn: Value
Algorithm n	Attribute 1, Attribute 2, Attribute n	P1: Value, P2: Value, Pn: Value

Activity: Evaluate Experimentally
Goal	Evaluating the prospective algorithms that will compose the Data Mining model to be built.
Inputs	Dataset with the selected attributes; Selected and Parameterized Algorithms; Transformations.
Subactivities	Define the Experiment Goal; Plan Experiment; Operate Experiment; Analyze and Interpret Data; Describe Threats to the Validity.
Results (Outputs)	Results and Detailing of the Experimental Process; Selected Data Mining Model (with one or more selected algorithms).

Activity: Validate Strategic Goals
Goal	Formalization and acceptance of the strategic goals for the implementation of DM.
Inputs	Preliminary scope and selected grid; Selected Data Mining Model; Description and/or recording, with new or incremented User Stories (optional); Visualization Prototype.
Subactivities	Validate Strategic Goals; Review and Adjust.
Results (Outputs)	Acceptance Document or Non-conformity List.

Validation Checklist for the Implementation of DM
Evaluating Team		Date
Intelligence Area and Management		-
Set Goals and Prototypes		Validation
Strategic Goal		<Accepted/Refused>
Mining Goal		< Accepted/Refused >
Prototype		< Accepted/Refused >
Conclusion
This document formalizes the acceptance delivery by considering it in conformity with the defined requirements and acceptance criteria, as well as by considering the validation of all produced documents.
Participant	Signature	Date
Intelligence Area		-
Management (Business Area)		-

Activity: Implement DM
Goal	Implementation of the Selected Mining Model.
Inputs	Selected Data Mining Model. Visualization Prototype.
Subactivities	Select the Most Effective Algorithm(s); Train them with all the available Database; Implement Application by using the Algorithm(s).
Results (Outputs)	Implemented Data Mining Model.

GOAL
Reducing school dropout by identifying which factors may contribute to the academic unsuccess of the students.
FROM THE VIEWPOINT OF BUSINESS
OBJECT	GOAL	QUALITY FOCUS	VIEWPOINT	CONTEXT
Students of the Graduation Level of all the campuses of the Federal Institution.	Evaluating which characteristics of the students can compose the academic unsuccess factors.	Reducing the academic unsuccess of the students in the school dropout issue by 5% by the end of the year.	Rector, Pro-rector, Student and Professor Coordinators, Office Managers and Principals.	Education Area
FROM THE VIEWPOINT OF MINING
OBJECT	GOAL	QUALITY FOCUS	VIEWPOINT	CONTEXT
Data Mining Algorithms	Evaluating and Predicting	Reaching a school dropout prediction with accuracy of 90% or more	Office Managers, students, data analysis professionals and data scientists	Students of the graduation level of all the campuses of the FEI.
QUALITY FOCUS (QUESTIONS AND METRICS)			VARIATION FACTORS
PE-G-Q1 - What is the school dropout rate in a school year? Sdr (A): School dropout rate in year A.	ALG-G-Q1 - What are the accuracies of the main machine learning algorithms for the school dropout prediction task?	Acur (n): Algorithm n accuracy	-
BASELINE HYPOTHESES			IMPACT ON BASELINE HYPOTHESES
Sdr(2020) = 8,22% Acur(Decision Tree) = 70%			-
MODEL INTERPRETATION
PE-G-Q1 = Sdr(2021) / Sdr(2020) <= 0.95 <Reduction of 5% in school dropout from 2021 in relation to the dropout of 2020>	ALG-G-Q1 = Acur (n) >= 90%
OVERALL GOAL WITH PRELIMINARY VISUALIZATION SCOPE
ANALYZING data mining algorithms, IN ORDER TO predict and evaluate, REGARDING THE EFFICACY of the school dropout prediction, FROM THE VIEWPOINT OF THE Office Managers, Students, Data Analysts and Data Scientists, IN THE CONTEXT of the students who are active and enrolled in the graduation level of a Federal Institution. As the mining result, the general dropout probability will be presented, as well as the individual probability, identifying the active and enrolled students who tend to drop out, enabling the adoption of measures that can change such scenario. In the image below (Figure 9), we have the initial prototype of the mining process outputs.

DEFINITION OF THE EXPERIMENT GOAL
Goal Definition
It was already described in Table 11.
EXPERIMENT PLANNING
Context Selection
It was an “in vitro” experiment, because the data were taken from the real environment, so they could be transformed and then used in a controlled environment. The data of the students in all Undergraduate courses were considered, by incorporating the freshmen between 2003, the year the first undergraduate courses started, and 2020. The data gathering considered personal, academic and social attributes.
Research Questions
In the context of school dropout, from the selected algorithms, which ones presented the best indicators in terms of efficacy?; Does the accuracy exceed the established goal of 90%?;
Dependent Variables
Classifications, from which can be derived: Accuracy, Sensitiveness (Recall), Precision and F1-Measure.
Independent Variables
The dataset described in Table 12 and the algorithms previously listed.
Formulation of Theoretical Hypotheses
• H₀: The algorithms (_1,2..n) have the same efficacy. • H₁: The algorithms (_1,2..n) have different efficacy.
Selection of Participants and Objects
All the undergraduate students of the Institution were selected, totalizing 10,949 students, of which 6,961 (63,57%) were part of the class target (Dropouts) and 3.988 (36,42%) were part of the class control (Non-dropouts). One of the metrics used was accuracy, which requires the balancing of the classes. Thus, it was necessary to conduct the balancing process, the one that considered the highest amount of records presented in each class, 3,988, being the final total a value superior to a sample for infinite population, according to the literature foundation (Pinto, 2015Pinto, P. (2015).Introdução à Análise Estatística-Vol 2(Vol. 2). Sílabas & Desafios.). Considering the population of the Institute, the final sample of 7,976 students has margin of error of 0,57%, for a reliability of 95%.
Experiment Project
For the evaluation of the model, we used the 10-Fold Cross-Validation approach, in which the data are divided into 10 parts, keeping their proportions. Therefore, 10 tests were conducted, separating a part of the data to be tested later. Besides, it will be possible to obtain annual, semi-annual, bi-monthly, quarterly, monthly, and half-monthly results
Instrumentation
For the data mining process, we used the Python pycaret library, which is a high-level open source machine learning library, whose purpose is to maximize the comparison and usability performance of the Scikit-Learn library. For the execution of the Python code, we used the Google Colab cloud environment, which is destined to the creation and execution of codes in Python, directly from a browser, without the necessity to install any software into local machines. The data used for analysis came from SIGAA, the academic system used in the institution, which uses the PostgreSQL as DBMS (Database Management System).
EXPERIMENT OPERATION
Preparation
It consisted in the preparation of the dataset that was used in mining. The preparation occurred by following the subprocess Prepare Data of the proposed process.
Execution
It consisted in the execution of the classificatory process in the data, planned in the experiment Project, for each selected mining algorithm, by using the other independent variables.
Data Validation
Three types of statistical tests were used to assist with the analysis, interpretation and validation: the Shapiro-Wilk Test, the Levene Test and the Paired T Test. The Shapiro-Wilk Test was used for the Normality test, and the Levene Test for the homoscedasticity test. Once the normality assumption was fulfilled and the homoscedasticity was not, the Paired T Test was used to test the hypotheses.
RESULTS
Data Analysis and Interpretation
After the execution of the algorithms, the results of the classifications, which will be presented in Figure 10, were obtained by using the 10-Cross-Validation approach.
Threats to the Validity
Threats to the internal validity: The current academic system has been present in the institution since 2017, and has inherited the basis of the previous academic system with several inconsistent information, mainly by the middle of 2007. This threat was mitigated by conducting the cleaning process of the data, reducing the probability of the use of older incorrect information.

Algorithm	Accuracy	F1-Measure
Light Gradient Boosting Classifier	0.0856	0.6809
Gradient Boosting Classifier	0.2881	0.1274
Random Forest Classifier	0.1525	0.7854

Algorithm 1 against Algorithm 2	Accuracy	F1-Measure
Light Gradient Boosting Classifier/Gradient Boosting Classifier	0.4951	0.5045
Light Gradient Boosting Classifier/Random Forest Classifier	0.08812	0.1038
Gradient Boosting Classifier/Random Forest Classifier	0.142	0.1603

Validation Checklist for the Implementation of DM
Evaluating Team		Date
Intelligence Area and Management		01/18/2022
Set Goals and Prototype		Validation
Reducing the academic unsucess of the students in the school dropout issue by 5% by the end of the year		Accepted
Reaching a school dropout prediction with accuracy of 90% or more		Accepted
Prototype		Accepted
Conclusion
This document formalizes the delivery acceptance by considering it in conformity with the defined requirements and acceptance criteria, as well as by considering the validation of all produced documents.
Participant	Signature	Date
Intelligence Area	<Supressed>	01/18/2022
Management (Business Area)	< Supressed >	01/18/2022

TECSI Laboratório de Tecnologia e Sistemas de Informação - FEA/USP Av. Prof. Luciano Gualberto, 908 FEA 3, 05508-900 - São Paulo/SP Brasil, Tel.: +55 11 2648 6389, +55 11 2648 6364 - São Paulo - SP - Brazil
E-mail: jistemusp@gmail.com

Acompanhe os números deste periódico no seu leitor de RSS

[1] Address for correspondence:
Methanias Colaço Júnior, Universidade de São Paulo - EACH - USP, São Paulo, SP, Brazil; Universidade Federal de Sergipe Aracajú, SE,Brazil. E-mail: mjrse@hotmail.com

Attribute	Description
Sexo	Gender of the student
Idade	Age of the student
tipo_instituicao_conclusao	Type of educational institution where the student concluded high school
Raca	Ethnicity
est_civil	Marital status
qtd_trac	Amount of subjects interrupted
reab_matricula	It indicates whether the student reinstated the course
qtd_ap_med_p	Average amount of approved subjects by semester
qtd_ap_1p	Amount of approved subjects in the first semester
qtd_rep_med_p	Average amount of failed subjects by semester
qtd_rep_1p	Amount of failed subjects in the first semester
qtd_per_cur	Amount of semesters attended by the student
Cra	Academic Performance Coefficient
qtd_disciplinas_concluidas	Total amount of subjects student passed
qtd_disciplinas	Total amount of subjects student enrolled in
media_geral	Overall performance rate of the student in the course
media_faltas	Overall absence rate of the student in the course
Cotista	It indicates whether the student was admitted through quota system