Dialect Classification of the Javanese Language Using the K-Nearest Neighbor

Keywords: case folding, Javanese dialect, K-Nearest Neighbor, Natural Langugae Processing, Synthetic Minority Oversampling Technique, tokenizing

Abstract

Indonesia is rich in ethnic and cultural diversity, each reflected in its unique linguistic characteristics. One way to preserve the Javanese language is by conducting research on its dialects. This study aims to classify three main dialects in Java Island—East Java, Central Java, and West Java—using text data from online sources. The classification process includes preprocessing (tokenizing, case folding, and word weighting), data balancing with the Synthetic Minority Oversampling Technique (SMOTE), and classification using the K-Nearest Neighbor (K-NN) algorithm. This study highlights the importance of dialect recognition in supporting the preservation of the Javanese language and the development of linguistic technology applications. Testing using 10-fold cross-validation showed the best performance at , with an accuracy of 94.05%, precision of 95.83%, and recall of 94.44%. These findings significantly support computational linguistics research and the preservation of regional languages.

Downloads

Download data is not yet available.

Author Biographies

Brilliant Filby, Universitas Negeri Malang

Department of Informatics Engineering

Utomo Pujianto, Universitas Negeri Malang

Department of Informatics Engineering

Jehad A. H. Hammad, Al-Quds Open University

Department of Computer Information Systems

Aji Prasetya Wibawa, Universitas Negeri Malang

Department of Electrical and Informatics Engineering

References

Anandarajan, M., Hill, C., & Nolan, T. (2019). Practical Text Analytics: Maximizing the Value of Text Data. Springer Cham. doi:https://doi.org/10.1007/978-3-319-95663-3

Ardhana, A. P. (2018). Klasifikasi Tingkatan Bahasa pada Artikel Berbahasa Jawa dengan Metode Multinomial Naïve Bayes. Surakarta, Indonesia: Universitas Sebelas Maret. Retrieved from https://digilib.uns.ac.id/dokumen/detail/58424/

Asiyah, S. N. (2016). Online News Classification Using Support Vector Machine and K-Nearest Neighbor. Surabaya, Indonesia: Institut Teknologi Sepuluh Nopember. Retrieved from https://repository.its.ac.id/62883/1/1314105016-Undergradute%20Thesis.pdf

Ayub, M. (2007). Proses Data Mining dalam Sistem Pembelajaran Berbantuan Komputer. Jurnal Sistem Informasi, 2(1), 21-30. Retrieved from https://www.researchgate.net/profile/Mewati-Ayub/publication/237692809_Proses_Data_Mining_dalam_Sistem_Pembelajaran_Berbantuan_Komputer/links/5aeefe5c0f7e9b01d3e2bd70/Proses-Data-Mining-dalam-Sistem-Pembelajaran-Berbantuan-Komputer.pdf?_tp=eyJjb250ZXh0Ijp

Briliani, A., Irawan, B., & Setianingsih, C. (2019). Hate Speech Detection in Indonesian Language on Instagram Comment Section Using K-Nearest Neighbor Classification Method. 2019 IEEE International Conference on Internet of Things and Intelligence System (IoTaIS) (pp. 98-104). Bali, Indonesia: IEEE. doi:https://doi.org/10.1109/IoTaIS47347.2019.8980398

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, 16, 321–357. doi:https://doi.org/10.1613/jair.953

Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168-189. doi:https://doi.org/10.1017/pan.2017.44

Douzas, G., Bacao, F., & Last, F. (2018). Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Information Sciences, 465. doi:https://doi.org/10.1016/j.ins.2018.06.056

Elder, J., Miner, G., & Nisbet, B. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Middlesex County, United States: Academic Press. Retrieved from https://books.google.co.id/books?id=-B6amxqygTMC&dq=G.+Miner,+Practical+Text+Mining+and+Statistical+Analysis+for+Non-Structured+Text+Data+Applications.+Elsevier+Science,+2012.&lr=&hl=id&source=gbs_navlinks_s

Ethnologue. (2013, Feb 28). Methodology. Retrieved from Ethnologue: https://www.ethnologue.com/methodology/

Florensa, R. (2021). Peningkatan Kecepatan Pencarian K-Nearest Neighbour Berbasis Clustering pada Dialek Bahasa Minang. Yogyakarta, Indonesia: Universitas Gadjah Mada. Retrieved from https://etd.repository.ugm.ac.id/penelitian/detail/205771

Irfa, A. A., Adiwijaya, A., & Mubarok, M. S. (2018). Klasifikasi Topik Berita Berbahasa Indonesia Menggunakan k-Nearest Neighbor. Proceedings of Engineering. 5, pp. 3631-3640. Bandung, Indonesia: Universitas Telkom. Retrieved from https://core.ac.uk/download/pdf/299923375.pdf

Irfan, R. (2020). Analisis Perbandingan Algoritma K-Nearest Neighbor dan Support Vector Machine pada Pengklasifikasian Hadits Shahih Muslim. Jakarta, Indonesia: Universitas Islam Negeri Syarif Hidayatullah. Retrieved from https://repository.uinjkt.ac.id/dspace/bitstream/123456789/55999/1/RENALDY%20IRFAN-FST.pdf

Isnain, A. R., Supriyanto, J., & Kharisma, M. P. (2021). Implementation of K-Nearest Neighbor (K-NN) Algorithm For Public Sentiment Analysis of Online Learning. IJCCS (Indonesian Journal of Computing and Cybernetics Systems), 15(2), 121-130. Retrieved from https://jurnal.ugm.ac.id/ijccs/issue/view/4602

Jumeilah, F. S. (2017). Penerapan Support Vector Machine (SVM) untuk Pengkategorian Penelitian. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 1(1), 19 - 25. doi:https://doi.org/10.29207/resti.v1i1.11

Junaidi, J., Yani, J., & Rismayeti, R. (2016). Variasi Inovasi Leksikal Bahasa Melayu Riau di Kecamatan Pulau Merbau. Jurnal Pustaka Budaya, 3(1), 1-16. Retrieved from https://journal.unilak.ac.id/index.php/pb/article/view/582

Kadhim, A. I. (2018). An Evaluation of Preprocessing Techniques for Text Classification. International Journal of Computer Science and Information Security (IJCSIS), 16(6). Retrieved from https://www.researchgate.net/profile/Ammar-Kadhim-4/publication/329339664_An_Evaluation_of_Preprocessing_Techniques_for_Text_Classification/links/5c1b6aa6a6fdccfc705ae648/An-Evaluation-of-Preprocessing-Techniques-for-Text-Classification.pdf?_tp=eyJjb250ZX

Kannan, S., & Gurusamy, V. (2014). Preprocessing Techniques for Text Mining. International Journal of Computer Science & Communication Networks, 5(1), 7-16. Retrieved from https://www.researchgate.net/profile/Vairaprakash-Gurusamy/publication/273127322_Preprocessing_Techniques_for_Text_Mining/links/54f8319e0cf210398e949292/Preprocessing-Techniques-for-Text-Mining.pdf

Khamar, K. (2013). Short Text Classification Using kNN Based on Distance Function. International Journal of Advanced Research in Computer and Communication Engineering, 2(4), 1916-1919. Retrieved from https://www.academia.edu/download/38502879/knn.pdf

Kumar, A., & Paul, A. (2016). Mastering Text Mining with R. Birmingham, UK: Packt Publishing. Retrieved from https://www.oreilly.com/library/view/mastering-text-mining/9781783551811/

Liao, Y., & Vemuri, V. (2002). Use of K-Nearest Neighbor classifier for intrusion detection. Computers & Security, 21(5), 439-448. doi:https://doi.org/10.1016/S0167-4048(02)00514-X

Mughnyanti, M. (2020). Analisis penggunaan Manhattan distance dan euclidean distance pada algoritma x-means dalam pengelompokan data. Medan, Indonesia: Universitas Sumatera Utara. Retrieved from https://digilib.usu.ac.id/en/detail.php?ib=201023104920853&i=

Nurjanah, W. E., Perdana, R. S., & Fauzi, M. A. (2017). Analisis Sentimen Terhadap Tayangan Televisi Berdasarkan Opini Masyarakat pada Media Sosial Twitter menggunakan Metode K-Nearest Neighbor dan Pembobotan Jumlah Retweet. JPTIIK (Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer), 1(12), 1750–1757. Retrieved from https://j-ptiik.ub.ac.id/index.php/j-ptiik/article/view/631

Palinoan, V. W. (2014). Sistem Klasifikasi Dokumen Bahasa Jawa dengan Metode K-Nearest Neighbour. Sleman, Indonesia: Universitas Sanata Dharma. Retrieved from https://repository.usd.ac.id/4346/

Pamungkas, R. D., & Hidayatullah, A. F. (2021). Tinjauan Literatur : Identifikasi Dialek Dengan Deep Learning. Automata. 2. Yogyakarta, Indonesia: Universitas Islam Indonesia. Retrieved from https://journal.uii.ac.id/AUTOMATA/article/view/17390

Purnomo, G. W. (2021). Identifikasi Asal Daerah Berdasarkan Logat Manusia dengan Metode Linear Predictive Coding (LPC) dan K-Nearest Neighbor (K-NN). Bandung, Indonesia: Universitas Telkom. Retrieved from https://openlibrary.telkomuniversity.ac.id/home/catalog/id/175067/slug/identifikasi-asal-daerah-berdasarkan-logat-manusia-dengan-metode-linear-predictive-coding-lpc-dan-k-nearest-neighbor-k-nn-.html

Sarkar, D. (2019). Text Analytics with Python: A Practitioner's Guide to Natural Language Processing. Apress Berkeley. doi:https://doi.org/10.1007/978-1-4842-4354-1

Sarwono, J. (2012). Metode Riset Online: Teori, Praktik, dan Pembuatan Apliaksi (Menggunakan HTML, PHP, dan CSS). Jakarta, Indonesia: Elex Media Komputindo. Retrieved from https://books.google.co.id/books?id=dttMDwAAQBAJ&hl=id&source=gbs_navlinks_s

Srividhya, V., & Anitha, R. (2010). Evaluating Preprocessing Techniques in Text Categorization. International Journal of Computer Science and Application, 47(11), 49-51. Retrieved from http://sinhgad.edu/ijcsa-2012/pdfpapers/1_11.pdf

Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering, 69, pp. 1356-1364. doi:https://doi.org/10.1016/j.proeng.2014.03.129

Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104-112. doi:https://doi.org/10.1016/j.ipm.2013.08.006

Vijayarani, S., Ilamathi, J., & Nithya, N. (2015). Preprocessing Techniques for Text Mining - An Overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16. Retrieved from https://www.ttcenter.ir/ArticleFiles/ENARTICLE/3783.pdf

Wahyono, W., Trisna, I. N., Sariwening, S. L., Fajar, M., & Wijayanto, D. (2020). Comparison of distance measurement on k-nearest neighbour in textual data classification. Jurnal Teknologi dan Sistem Komputer, 8(1), 54-58. doi:https://doi.org/10.14710/jtsiskom.8.1.2020.54-58

Zong, C., Xia, R., & Zhang, J. (2021). Text Data Mining. Singapore: Springer. doi:https://doi.org/10.1007/978-981-16-0100-2

Published
2025-01-19
How to Cite
Filby, B., Pujianto, U., Hammad, J. A. H., & Wibawa, A. P. (2025). Dialect Classification of the Javanese Language Using the K-Nearest Neighbor. Journal of Information Technology and Cyber Security, 2(2), 111-122. https://doi.org/10.30996/jitcs.12213
Section
Research Article