It was already described in Table 11. |
EXPERIMENT PLANNING |
Context Selection |
It was an “in vitro” experiment, because the data were taken from the real environment, so they could be transformed and then used in a controlled environment. The data of the students in all Undergraduate courses were considered, by incorporating the freshmen between 2003, the year the first undergraduate courses started, and 2020. The data gathering considered personal, academic and social attributes. |
Research Questions |
In the context of school dropout, from the selected algorithms, which ones presented the best indicators in terms of efficacy?; Does the accuracy exceed the established goal of 90%?; |
Dependent Variables |
Classifications, from which can be derived: Accuracy, Sensitiveness (Recall), Precision and F1-Measure. |
Independent Variables |
The dataset described in Table 12 and the algorithms previously listed. |
Formulation of Theoretical Hypotheses |
• H0: The algorithms (1,2..n) have the same efficacy. • H1: The algorithms (1,2..n) have different efficacy. |
Selection of Participants and Objects |
All the undergraduate students of the Institution were selected, totalizing 10,949 students, of which 6,961 (63,57%) were part of the class target (Dropouts) and 3.988 (36,42%) were part of the class control (Non-dropouts). One of the metrics used was accuracy, which requires the balancing of the classes. Thus, it was necessary to conduct the balancing process, the one that considered the highest amount of records presented in each class, 3,988, being the final total a value superior to a sample for infinite population, according to the literature foundation (Pinto, 2015Pinto, P. (2015).Introdução à Análise Estatística-Vol 2(Vol. 2). Sílabas & Desafios.). Considering the population of the Institute, the final sample of 7,976 students has margin of error of 0,57%, for a reliability of 95%. |
Experiment Project |
For the evaluation of the model, we used the 10-Fold Cross-Validation approach, in which the data are divided into 10 parts, keeping their proportions. Therefore, 10 tests were conducted, separating a part of the data to be tested later. Besides, it will be possible to obtain annual, semi-annual, bi-monthly, quarterly, monthly, and half-monthly results |
Instrumentation |
For the data mining process, we used the Python pycaret library, which is a high-level open source machine learning library, whose purpose is to maximize the comparison and usability performance of the Scikit-Learn library. For the execution of the Python code, we used the Google Colab cloud environment, which is destined to the creation and execution of codes in Python, directly from a browser, without the necessity to install any software into local machines. The data used for analysis came from SIGAA, the academic system used in the institution, which uses the PostgreSQL as DBMS (Database Management System). |
EXPERIMENT OPERATION |
Preparation |
It consisted in the preparation of the dataset that was used in mining. The preparation occurred by following the subprocess Prepare Data of the proposed process. |
Execution |
It consisted in the execution of the classificatory process in the data, planned in the experiment Project, for each selected mining algorithm, by using the other independent variables. |
Data Validation |
Three types of statistical tests were used to assist with the analysis, interpretation and validation: the Shapiro-Wilk Test, the Levene Test and the Paired T Test. The Shapiro-Wilk Test was used for the Normality test, and the Levene Test for the homoscedasticity test. Once the normality assumption was fulfilled and the homoscedasticity was not, the Paired T Test was used to test the hypotheses. |
RESULTS |
Data Analysis and Interpretation |
After the execution of the algorithms, the results of the classifications, which will be presented in Figure 10, were obtained by using the 10-Cross-Validation approach. |
Threats to the Validity |
Threats to the internal validity: The current academic system has been present in the institution since 2017, and has inherited the basis of the previous academic system with several inconsistent information, mainly by the middle of 2007. This threat was mitigated by conducting the cleaning process of the data, reducing the probability of the use of older incorrect information. |