151

Prediction of academic performance using data

mining in rst year students of peruvian university

abstract

Academic performance is a subject that has

been studied for a long time. First year students

in universities are the most vulnerable to face

performance problems, resulting in possible

desertion. Data mining in education applies data

mining techniques in the information generated in

the education sector. e present research consists of

making the prediction of the academic performance

of the students who entered the Professional School

of Computer and Systems Engineering of the

University of San Martín de Porres in the rst cycle

using data mining. Data were extracted from 1304

entrants who were classied using three factors: social,

economic and academic, and predictions were made

using three techniques: linear regression, decision tree

and support vector machines, having the best result

of 82.87% obtained using the decision tree. Out of

the dierent factors, those that most inuenced the

academic performance were the following: admission

exam grade, gender, age, income and distance from

home to the study center. Using data mining it was

possible to elaborate predictions of the academic

performance of the students, which allowed the

detection of students who could encounter issues in

their studies during the rst semester.

Key words: Academic Performance, prediction,

Educational Data Mining, EDM, Higher Education

resumen

El rendimiento académico es un tema estudiado

desde hace mucho tiempo. Los alumnos ingresantes

de las universidades son los más vulnerables a

enfrentar problemas de rendimiento, resultando en

posible deserción. La minería de datos en educación

aplica técnicas de minería de datos en la información

generada en el sector educación. El presente trabajo

consiste en realizar la predicción del rendimiento

académico de los alumnos que ingresaron a la

Escuela Profesional de Ingeniería de Computación y

Sistemas de la Universidad de San Martín de Porres

en el primer ciclo utilizando minería de datos. Se

extrajeron datos de 1304 ingresantes que fueron

E Y



L C S



R C P



V  J H H



Universidad de San Martín de Porres.

Lima - Perú

eyamao@usmp.pe

2 Universidad de San Martín de Porres.

Lima - Perú

lcelis@usmp.pe

3 Universidad Nacional Federico Villarreal.

Lima - Perú

rcampos@unfv.edu.pe

4 Universidad de San Martín de Porres.

Lima - Perú

vhuancash@usmp.pe

Predicción del rendimiento académico mediante minería de datos en

estudiantes del primer ciclo en una universidad peruana

Recibido: junio 22 de 2018 | Revisado: julio 26 de 2018 | Aceptado: agosto 02 de 2018

https://doi.org/10.24265/campus.2018.v23n26.05

| C | L,  | V. XXIII | N. 26 | PP. - | - |  |  -

152

clasicados en tres factores: sociales, económicos y

académicos y se realizaron predicciones a través de

tres técnicas: regresión lineal, árbol de decisiones y

support vector machines, y el mejor resultado de

82.87% se obtuvo utilizando árbol de decisiones.

De los diferentes factores, los que más inuyeron en

el rendimiento académico fueron los siguientes: nota

de examen de admisión, género, edad, modalidad

de ingreso y distancia desde su casa hasta el centro

de estudios. Utilizando minería de datos fue posible

realizar predicciones del rendimiento académico de los

ingresantes. Esto permitió la detección de ingresantes

que podrían enfrentarse a problemas en sus estudios.

Palabras clave: rendimiento académico, predicción,

educational data mining, educación superior

Introduction

One of the main challenges in

education is the exponential growth of

data generated by information systems

and technology and its use to improve

the quality of educational services

oered with a better decision making.

Educational Data Mining is an emerging

discipline that aims to take advantage of

the new capabilities of data processing and

the maturity of data mining algorithms

to enhance the learning process and

transform existing information into

knowledge. (Han, Kamber & Pei, 2012)

(Romero & Ventura, 2012) (Chalaris,

Gritzalis, Maragoudakis, Sgouropoulou,

y Tsolakidis, 2014) (Romero, Ventura,

Pechenizkiy, y Baker, 2011).

e main subject of study in higher

education is the academic performance

of students. It´s not only a tangible value

to measure the progress of a student in a

given course or subject but it´s also one

of the feature to see the level of success

during and after obtaining a degree. Is

of great importance for the educational

institutions given that the level of

success of their students is a reection

of the quality of the institution. (Calisir,

Basak, y Comertoglu, 2016) (Rodríguez

y Arenas2016) (López Bonilla, López

Bonilla, Serra, y Ribeiro, 2015) (York

2015) (Shahiri y Husain, 2015).

Prediction is one of the oldest

applications of EDM. Multiple studies

have successfully created models to

predict academic performance. It is a

complex process given that multiple

elements have been attributed to impact

academic performance of students. A

proper prediction model can be used to

detect those who might face diculties

in their studies and be in risk of dropping

out. Measures to help those students in

risk by additional tutoring, changes and

improvements in courses or curricula

are some of the adjustments made

previously. (Ramesh, parkavi y ramar,

2013) (Mishra, Kumar y Gupta, 2014)

(Strecht, Cruz, Soares, Merdes-Moreria y

Abren 2015) (ElGamal, 2013).

e most vulnerable students in

universities, who might face diculties

and drop out are found in rst year

students. Adaptation to life in a university

can be a great challenge for many and

| C | V. XXIII | N. 26 |  -  | 2018 |

E Y - L C S - R C P - V  J H H

153

some never manage to adapt completely.

Multiple studies have found that the

number of students with failing grades

are higher in the rst year, which extends

their time in the university and in some

cases dropping out. Student retention and

their graduation are an important goal for

universities, especially in STEM majors

where student dropout rate might exceed

the 30% in rst year students. (Baradwaj

& Pal, 2012) (Sepehrian, 2012) (Li, Rusk

y Song, 2013) (Cheewaprakobkit, 2013).

Previous studies of prediction using

EDM include Mishra et al. (2014) to

predict academic performance of third

year student in computer science major

using J48 and Random Tree algorithm.

Performance in previous semesters and

other courses, leadership and motivation

are inuential in academic performance.

Elakia & Aarthi (2014) uses student

characteristics in high school to predict

in which major they will have the best

performance and discipline in universities

based on behavioral trends in high school.

Pal (2012), predicts low achievement and

chances of dropout using multiple decision

tree algorithm. Ramesh et al. (2013) uses

prediction in nal exam grade to nd those

who could fail a course, being occupation

of the parents a strong impact in the

results. Gray, McGuinness y Owende

(2014) predicted academic achievement of

rst year student, using gender, age, high

school grades, personality, motivation and

learning style. Students under 21 years

had a better prediction results. Sembiring,

Zarlis, Hartama, Ramliana, y Wani

(2011) uses SVM ad clustering to predict

academic performance using interest,

beliefs, family support, attitude and time

spent studying achieving a high prediction

rate of 93.67%.

ere is very little studies made in

Peru related to academic performance

and almost none using EDM. is study

is one of the rst to create a prediction

model from student characteristics found

in Peruvian universities.

Materials and method

Methodology

e present study used a quantitative

method. Scientic studies and related

papers where searched to determine the

proper variables needed to achieve the

objectives of this study and to create

a theoretical relationship between the

variables. e design of this study is

correlational – causal, aimed to describe

the relationships between academic

achievement and characteristics of

rst year students used to validate the

prediction methods.

Population and sample

e population of this study are the

rst year students of the Information

Systems career of the San Martin de

Porres University. e sample has been

taken from the years 2010 to 2015, giving

1304 students who were admitted, from

which transfer students (who did not take

the courses of the rst year), those who

dropped out in the rst weeks or never

enrolled in any course were removed

from the data.

Dataset

All available data from admissions

oce and faculties has been collected

and cleaned to create the dataset. From

the revision of the literature and the

| C | V. XXIII | N. 26 |  -  | 2018 |

P             

154

data found in the data repositories of the

university, the selected variables for the

study are:

• AGE. It´s the age of the student at

the date of the rst class in university.

• GENDER. e gender, male or

female of the student, for this study

a binary representation, zero (0) for

female and one (1) for male is used.

• PROVINCE. To identify those

students that came from another

province to study in the capital.

• SCHOOLTYPE. Dierent types of

school (national, private, religious

and others).

• COLEEXC. Based on a list of the

top 500 schools with student having

the best grades in university.

• ADMISSIONEXAM. Grade of

the student in the admission exam.

Presented in percentage of the total

score.

• DISTANCE. e total distance

calculated from the place of residence

of the student to the faculty where

the classes were given. In some cases,

there was no data for the address so

the average distance from the selected

district is used.

• APPROVED. To indicate the PASS

or FAIL status of the student in the

rst year of university.

Data Mining Methods

ree data mining methods has been

selected to be applied in the dataset.

Regression: e purpose for this model

is to t the data to a model based on

variables. Answers questions like: what is

the forecast of sales for the next month?

(Han et al. 2012)

Decision Trees: Represent a group of

classication rules in shape of a tree,

based on an if-then ruleset. (Han et al.

2012).

For this study C5.0 algorithm was

selected.

Support Vector Machines: Support

vector machines is a type of algorithm

that builds a model to represent simple

point in a higher dimension to dene a

hyperplane to be used to create an optimal

separation between classes to achieve

proper classication. For the denition

of the hyperplane, the algorithm uses

the support vectors to map the data in

a high enough dimension to make the

classication. (Lantz, 2013)

Ethical aspects

e main ethical aspects about this type

of research is the privacy of the personal

data about students and professors.

Information that can be used to identify

a person requires authorization before its

use and publication.

To assure the privacy of the data

used in this study an anonymization

process was applied to the data related

to students and professors. A unique

ID was assigned to each row of the data

and every column that could be used for

individual identication like rst name,

last name, ID card number, address has

been removed. is way, protecting the

privacy and the validity of the study

results is achieved.

| C | V. XXIII | N. 26 |  -  | 2018 |

155

Result and discussion

Results

Regression

Linear Regression models aim to t

into a linear model y=f(x) all the attributes

of the database don the relation between

the dependent variable (y) and the

independent variable (x). For this study,

a logistic regression model has been used

to create a model to predict the PASS/

FAIL condition of the rst year students.

For the selection of the most inuential

variables to be added to the model,

backward selection method has been

used. e variables with the best t are

shown in Table 1. As expected admission

exam score is the most important

variable and the distance has a negative

coecient, meaning that student with

less travel distance from their homes

are more likely to pass the courses. e

gender also has a negative coecient,

indicating that female student are more

likely to earn a passing grade.

Table 1

Results from logistic regression

Coecients:

Estimate Std. Error Z value Pr(>|z|) Signicance

(Intercept) -3.37717 0.87575 -3.856 0.000115 ***

ADMISSIONEXAM 2.70438 0.47570 5.685 1.31e-08 ***

AGE 0.10228 0.04379 2.335 0.019525 *

GENDER -0.55567 0.24382 -2.279 0.022663 *

DISTANCE -0.03678 0.01665 -2.209 0.027194 *

To use this model for prediction, the

dataset has been split randomly into

75% for the training and 25% for the

testing. e results of the prediction used

this model is as seen on Table 2. Further

examination of the results from this

model has shown that there is a signicant

dierence in the rate of passing the

courses on the type of admission exam

taken.

Table 2

Prediction results logistic regression

Exactude Sensibility Specicity AUC

67.4% 69.44% 59.45% 68.82%

Considering this new discovery, the

type of admission exam was used to

perform a further analysis and a new

model with a better prediction capability

was discovered using the variables in Table

3. In this new model, only considering

those who were admitted via the ordinary

type of admission process is considered.

e results from the prediction using this

model is as seen on Table 4.

| C | V. XXIII | N. 26 |  -  | 2018 |

156

Table 3

Results from logistic regression ltered by ordinary admission type

Coecientes:

Esmate Std. Error Z value Pr(>|z|) Signicance

(Intercept) -4.2345 0.5179 -8.176 2.96e-16 ***

ADMISSIONEXAM 10.9480 1.3225 8.278 < 2e-16 ***

GENDER -0.8137 0.3398 -2.395 0.0166 *

COLEEXC -1.1600 0.5839 -1.987 0.0470 *

Table 4

Prediction results logistic regression ltered by ordinary admission type

Exactude Sensibility Specicity AUC

74.4% 74% 76% 82.12%

Decision trees

e C5.0 algorithm was used to create

models based on an IF THEN rules to

study the importance of each variable

and predict the PASS / FAIL outcome.

Boosting techniques has been applied to

make multiple iterations of the algorithm

to enhance the results. Using what was

learned from the regression model, the

type of admission was used a criteria to

study the data. e best model found

using the decision trees method is as seen

on Figure 1.

Figure 1. Decision Tree model ltered by ordinary admission type

Admission exam score, gender, age

and variables related to the type of

school are considered in this model. As

in the regression model, a high admission

exam score and being female are strong

indicators of having a PASS outcome.

Prediction using this model achieved an

exactitude in the prediction of 82.87%.

| C | V. XXIII | N. 26 |  -  | 2018 |

157

Support Vector Machines SVM

Support vector machines method

was used to create models based support

vectors, to create a hyperplane to make

the classication. An important part of

SVM is the selection of the kernel to be

used to model the data. e linear and

polynomial kernel couldn’t create a model

good enough to be used for prediction,

but the Gaussian and sigmoid kernels

managed to create models, achieving the

best results in predicting the outcome

the Gaussian kernel with a exactitude of

75.2%

Discussion

Prediction of the outcome of the rst

year students has been achieved. e

prediction results of the best models for

each DM method used in this study is as

seen on Table 5.

Table 5

Results of exactitude of prediction from Logistic regression, C5.0 decision trees and SVM

algorithms

Algorithm Predicon

Logisc Regression 74.4%

C5.0 82.87%

SVM 75.2%

As in other studies (Ecklund, 2013) (Li,

Swaminathan, & Tang, 2009) (Veenstra,

Dey, & Herrin, 2008) (Honken &

Ralstron, 2013) (Elakia & Aarthi, 2014)

(Gray et al. 2014), admission exam

score is an important evaluation tool for

prediction. e higher the admission

exam score, the more likely for the rst

year student to pass the rst semester.

Gender was also found to be important,

as female students had a higher chance of

passing the rst semester and the closer

the distance between the student’s place

of residence and the university, the more

likely it is to have a passing grade.

e type of elementary and junior

high school attended before studying

in the university had little impact on

student outcome.

Conclusions

Predicting the future outcome of

rst year students is an important way

for universities to detect those who will

most likely face problems to pass the

rst semester of studies, to give them the

necessary support and prevent student

dropout.

Female students in this study were

found to have a higher passing rate than

their female counterpart. Usually, being

a gender minority, as is usual in STEM

careers is considered to have a negative

impact (Bayer, Bydzovska, Geryk,

Obsıvac y Popelinsky, 2012) (Ecklund,

2013), which is a topic for future studies.

e type of school in elementary and

junior high school did not have a strong

| C | V. XXIII | N. 26 |  -  | 2018 |

158

impact in the outcome, and the students

coming from schools that are considered

to the better ones either. It is most likely

that other characteristics like emotional

maturity, motivation, group of friends

and others have a similar or stronger

impact than academic ones.

As it is often mentioned, the data

quality and quantity is important in

References

these type of studies. Given that this

study took place in one career in one

university, it is of interest for future

studies to compare the results with

other careers like accounting (Byrne

& Flood, 2008) or engineering (Li et

al., 2009) (Veenstra, Dey, & Herrin,

2008) (Cheewaprakobkit, 2013) (Li, et

al. 2013), and in other universities in

Peru.

Baradwaj, B. K., & Pal, S. (2012).

Mining educational data to analyze

students’ performance. arXiv

preprint arXiv:1201.3417.

Bayer, J., Bydzovská, H., Géryk, J.,

Obsivac, T., & Popelinsky, L.

(2012). Predicting Drop-Out

from Social Behaviour of Students.

International Educational Data

Mining Society.

Byrne, M., & Flood, B. (2008).

Examining the relationships

among background variables and

academic performance of rst year

accounting students at an Irish

University. Journal of Accounting

Education, 26(4), 202-212.

Calisir, F., Basak, E., & Comertoglu,

S. (2016). Predicting academic

performance of master’s students in

engineering management. College

Student Journal, 50(4), 501-513.

Chalaris, M., Gritzalis, S., Maragoudakis,

M., Sgouropoulou, C., &

Tsolakidis, A. (2014). Improving

quality of educational processes

providing new knowledge using

data mining techniques. Procedia-

Social and Behavioral Sciences, 147,

390-397.

Cheewaprakobkit, P. (2013). Study of

Factors Analysis Aecting Academic

Achievement of Undergraduate

Students in International Program.

In Proceedings of the International

MultiConference of Engineers and

Computer Scientists (Vol. 1, pp. 13-

15).

Ecklund, A. P. (2013) Enhancing

Incoming Male Student Retention:

An Analysis of the Experiences of

Persistence in Engineering.

Elakia, G., & Aarthi, N. J. (2014).

Application of data mining in

educational database for predicting

behavioural patterns of the

students. International Journal of

Computer Science and Information

Technologies (IJCSIT), 5(3), 4649-

4652.

ElGamal, A.F. (2013). An Educational

Data Mining Model for

| C | V. XXIII | N. 26 |  -  | 2018 |

159

Predicting Student Performance

in Programming Course.

International Journal of Computer

Applications, 70(17).

Gray, G., McGuinness, C., & Owende,

P. (2014). An application of

classication models to predict

learner progression in tertiary

education. In Advance Computing

Conference (IACC), 2014 IEEE

International(pp. 549-554). IEEE.

Han, J., Kamber, M., & Pei, J. (2012).

Data Mining: Concepts and

Techniques, Elsevier.

Honken, N. B., & Ralston, P. A.

(2013). High-Achieving High

School Students and Not So

High-Achieving College Students

A Look at Lack of Self-Control,

Academic Ability, and Performance

in College. Journal of Advanced

Academics, 24(2), 108-124.

Lantz, B. (2013). Machine learning with

R. Packt Publishing Ltd.

Li, Q., Swaminathan, H. and Tang,

J. (2009), Development of

a Classication System for

Engineering Student Characteristics

Aecting College Enrollment and

Retention. Journal of Engineering

Education, 98: 361–376. doi:

10.1002/j.2168-9830.2009.

tb01033.x

Li, K. F., Rusk, D., & Song, F. (2013).

Predicting student academic

performance. In Complex,

Intelligent, and Software Intensive

Systems (CISIS), 2013 Seventh

International Conference on (pp. 27-

33). IEEE.

López

Bonilla, J. M., López Bonilla,

L. M., Serra, F., & Ribeiro, C.

(2015). Relación entre actitudes

hacia la actividad física y el

deporte y rendimiento académico

de los estudiantes universitarios

españoles y portugueses. Revista

iberoamericana de psicología del

ejercicio y el deporte, 10(2), 275-284.

Mishra T., Kumar D. & Gupta S.

(2014) Mining Students’ Data

for Prediction Performance

Fourth International Conference

on Advanced Computing &

Communication Technologies,

Rohtak, pp. 255-262.

Ramesh, V., Parkavi, P., & Ramar,

K. (2013). Predicting student

performance: a statistical and data

mining approach. International

journal of computer

applications,63(8).

Rodríguez, Á. P. A., & Arenas, D. A. M.

(2016). Programas de intervención

para Estudiantes Universitarios

con bajo rendimiento académico.

Informes Psicológicos, 16(1), 13-34.

Romero, C., Ventura, S., Pechenizkiy,

M., & Baker, R. S. (Eds.). (2011).

Handbook of educational data

mining. CRC Press.

Romero, C., & Ventura, S. (2013).

Data mining in education. Wiley

Interdisciplinary Reviews: Data

Mining and Knowledge Discovery,

3(1), 12-27.

| C | V. XXIII | N. 26 |  -  | 2018 |

160

Sembiring, S., Zarlis, M., Hartama, D.,

Ramliana, S., & Wani, E. (2011).

Prediction of student academic

performance by an application

of data mining techniques. In

International Conference on

Management and Articial

Intelligence IPEDR (Vol. 6, pp.

110-114).

Sepehrian, F. (2012). Emotional

Intelligence as a predictor of

academic performance in university.

Journal of Educational Sciences and

Psychology, 2(2).

Shahiri, A. M., & Husain, W. (2015).

A review on predicting student’s

performance using data mining

techniques. Procedia Computer

Science,72, 414-422.

Strecht, P., Cruz, L., Soares, C., Merdes-

Moreria, J. & Abren, R. (2015). A

Comparative Study of Classication

and Regression Algorithms for

Modelling Students’ Academic

Performance. In 8th International

Conference on Educational Data

Mining, Madrid, Spain, 392-395.

Veenstra, C. P., Dey, E. L., & Herrin,

G. D. (2008). Is Modeling of

Freshman Engineering Success

Dierent from Modeling of Non

Engineering Success?. Journal of

Engineering Education, 97(4),

467-479.

York, T. T., Gibson, C., & Rankin, S.

(2015). Dening and measuring

academic success. Practical

Assessment, Research & Evaluation,

20(5), 2.

| C | V. XXIII | N. 26 |  -  | 2018 |