SELENIUM–INDOBERT PIPELINE FOR PSEUDO-LABELING SENTIMENT ANALYSIS OF INDONESIAN YOUTUBE COMMENTS
Abstract
YouTube has become a major platform for public discourse in Indonesia, yet large-scale sentiment analysis of its comments remains challenging due to dynamic content, informal language, and limited labeled data. This study proposes a Selenium–IndoBERT pipeline for sentiment analysis of Indonesian YouTube comments using a pseudo-labeling approach. Data were collected from ten YouTube videos discussing the One Piece flag phenomenon, yielding 10,842 comments after preprocessing. Selenium was employed to extract comments from dynamic pages, while IndoBERT was fine-tuned on a small manually labeled dataset and used to generate pseudo-labels for unlabeled data. Model performance was evaluated using probabilistic metrics, including Coverage, Expected Calibration Error (ECE), and Brier Score. At a confidence threshold of 0.75, 78.5% of comments received pseudo-labels, with an ECE of 0.095 and a Brier Score of 0.174. Manual validation showed substantial agreement with human annotations (Fleiss’ kappa = 0.72). The results indicate that the proposed pipeline enables scalable and reliable sentiment analysis with minimal manual annotation.
References
H. Murfi, S. Theresia Gowandi, G. Ardaneswari, and S. Nurrohmah, “BERT-based combination of convolutional and recurrent neural network for indonesian sentiment analysis,” Appl. Soft Comput., vol. 151, pp. 1–15, 2024, doi: 10.1016/j.asoc.2023.111112.
Y. Wu, Z. Jin, C. Shi, P. Liang, and T. Zhan, “Research on the application of deep learning-based BERT model in sentiment analysis,” Appl. Comput. Eng., vol. 71, no. 1, pp. 14–20, 2024, doi: 10.54254/2755-2721/71/2024 ma.
P. Lison, J. Barnes, and A. Hubin, “skweak: Weak Supervision Made Easy for NLP,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internatio nal Joint Conference on Natural Language Processing: System Demonstrations, Stroudsburg, PA, USA: Association for Computatio nal Linguistics, 2021, pp. 337–346. doi: 10.18653/v1/2021.acl-demo.40.
P. Thota and E. Ramez, “Web Scraping of COVID-19 News Stories to Create Datasets for Sentiment and Emotion Analysis,” ACM Int. Conf. Proceeding Ser., pp. 306–314, 2021, doi: 10.1145/3453892.3461333.
A. Namoun, M. A. Humayun, and W. Nawaz, “A Multimodal Data Scraping Tool for Collecting Authentic Islamic Text Datasets,” Int. J. Adv. Comput. Sci. Appl., vol. 15, no. 12, pp. 219–227, 2024, doi: 10.14569/IJACSA.2024.0151224.
V. Suter, M. Shahrezaye, and M. Meckel, “COVID-19 Induced Misinformation on YouTube: An Analysis of User Commentary,” Front. Polit. Sci., vol. 4, no. March, pp. 1–10, 2022, doi: 10.3389/fpos.2022.849763.
K. Sharma and G. M. Borkar, “Comparative Analysis of Dynamic Web Scraping Strategies: Evaluating Techniques for Enhanced Data Acquisition,” in Advancements in Communication and Systems, Soft Computing Research Society, 2024, pp. 241–252. doi: 10.56155/978-81-955020-7-3-22.
A. Sahoo, R. Chanda, N. Das, and B. Sadhukhan, “Comparative Analysis of BERT Models for Sentiment Analysis on Twitter Data,” in 2023 9th International Conference on Smart Computing and Communications (ICSCC), IEEE, Aug. 2023, pp. 658–663. doi: 10.1109/ICSCC59169.2023 .10335061.
Fransiscus and A. S. Girsang, “Sentiment Analysis of COVID-19 Public Activity Restriction (PPKM) Impact using BERT Method,” Int. J. Eng. Trends Technol., vol. 70, no. 12, pp. 281–288, Dec. 2022, doi: 10.14445 /22315381/IJETT-V70I12P226.
N. K. Nissa and E. Yulianti, “Multi-label text classification of Indonesian customer reviews using bidirectional encoder representati ons from transformers language model,” Int. J. Electr. Comput. Eng., vol. 13, no. 5, p. 5641, Oct. 2023, doi: 10.11591/ijece.v13i5 .pp5641-5652.
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.
U. Malik, S. Bernard, A. Pauchet, C. Chatelain, R. Picot-Clémente, and J. Cortinovis, “Pseudo-Labeling With Large Language Models for Multi-Label Emotion Classification of French Tweets,” IEEE Access, vol. 12, pp. 15902–15916, 2024, doi: 10.1109/ACCESS.2024.3354705.
J. Lai, X. Wang, Q. Xiang, W. Quan, and Y. Song, “A Semi-Supervised Stacked Autoencoder Using the Pseudo Label for Classification Tasks,” Entropy, vol. 25, no. 9, 2023, doi: 10.3390/e25091274.
D. Marutho and V. G. Utomo, “Benchmarking IndoBERT and Transformer Models for Sentiment Classification on Indonesian E-Government Service Reviews,” J. Transform., vol. 23, no. 1, pp. 86–95, Jul. 2025, doi: 10.26623/transformatika.v23i1.12095.
I. Mirpulatov, S. Illarionova, D. Shadrin, and E. Burnaev, “Pseudo-Labeling Approach for Land Cover Classification Through Remote Sensing Observations with Noisy Labels,” IEEE Access, vol. 11, no. July, pp. 82570–82583, 2023, doi: 10.1109/ACCESS.2023.3300967.
K. Huang, J. Geng, W. Jiang, X. Deng, and Z. Xu, “Pseudo-loss Confidence Metric for Semi-supervised Few-shot Learning,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Oct. 2021, pp. 8651–8660. doi: 10.1109/ICCV48922.2021.00855.
Z. Wang, Y. Luo, Z. Chen, S. Wang, and Z. Huang, “Cal-SFDA: Source-Free Domain-adaptive Semantic Segmentation with Differentiable Expected Calibration Error,” in Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA: ACM, Oct. 2023, pp. 1167–1178. doi: 10.1145/3581783.3611808.
D. Brahma and P. Rai, “A Probabilistic Framework for Lifelong Test-Time Adaptation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jun. 2023, pp. 3582–3591. doi: 10.1109/CVPR52729.2023.00349.
A. Jazuli, Widowati, and R. Kusumaningrum, “Optimizing Aspect-Based Sentiment Analysis Using BERT for Comprehensive Analysis of Indonesian Student Feedback,” Appl. Sci., vol. 15, no. 1, pp. 1–28, 2025, doi: 10.3390/app15010172.
W. J. Kusoema and I. Ibrahim, “Sentiment Analysis on the PT Pertamina Corruption Case using IndoBERT and RCNN Methods,” SISTEMASI, vol. 14, no. 5, p. 2246, Sep. 2025, doi: 10.32520/stmsi.v14i5.5392.








