CLASSIFICATION OF FAKE NEWS IN INDONESIAN LANGUAGE USING SUPPORT VECTOR MACHINE METHOD

: Since information and communication technology has become ingrained in our daily lives, it has become easier to access information. However, there are some concerns. One of them is about fake news. The aim of this study is to develop an Indonesian system for detecting false news by utilizing news headlines. The methods used are linear kernel support vector machine and n-gram. According to the findings of the performance test that was carried out, the linear kernel support vector machine model employing the term frequency inverse document frequency unigram feature performs better than utilizing bigram. The precision value generated from the model performance test is 1.00. This means that the degree of accuracy in matching the requested information regarding fake news detection with the answers provided by the system is very good. Then the recall value generated is 0.99. This means the linear kernel support vector machine model using unigram news features is very effective for detecting fake news according to the text classification approach.


INTRODUCTION
Fake news is any deceptive, envious, and unverified content in both textual and visual form and is used for political, personal, and state interests [1].
According to a survey conducted by [2], internet users in Indonesia in the 2021-2022 period (Q1) reached 210.03 million out of a total Indonesian population of 272.68 million people.Here it can be seen that most of the Indonesian people can utilize the internet to access the latest information and news, whether in the form of education, economy, politics, and others.
With the integration of information and communication technology in our lives, it has become easier to access information.However, there are concerns about a number of things.One of them is the concern about fake news.
Because of the sensationalism of the allegations, false news tends to spread quickly on digital and social media platforms.Brands are frequently implicated in the dissemination of fake news, either as unintentional targets, direct or indirect sources, or as facilitators of fake news [3].
Researchers developed a number of strategies, including human-machine hybrid methods, text classification, and dissemination network analysis, to get around this problem [4].The research shows that text classification is the most famous approach and many researchers put forward solutions in their research by formulating machine learning and deep learning models.
Classification is a method to identify patterns according to categories using machine learning approach [5].In practice, classification may accurately predict discrete, unordered category labels [6].The algorithms Naïve Bayes, Random Forest, Decision Tree, and Support Vector Machine are examples of classification algorithms.
Several previous studies have proposed various solutions to detect fake news using text classification.Research by [7] used machine learning algorithms to classify fake news based on headlines and news text as a comparison.The results showed that the linear kernel Support Vector Machine (SVM) got the best prediction results in detecting fake news based on headlines.Then, bootstrap aggregation got the best prediction results in detecting fake news according to the news text.
Research by [8] employed several iterations of the Support Vector Machine and Naïve Bayes algorithm to identify fraudulent news in Indonesian that was retrieved via a crawler.The results showed that the Support Vector Machine (SVM) sigmoid kernel got the best prediction results, with a precision of 95.6%, recall 100%, f1-score 97.7%, and accuracy 96.5%.
Research by [9] integrated ngrams to enhance deep learning models' performance.The findings demonstrated that the study's deep learning model achieved an accuracy of 99.88%.
Research by [10] used naïve bayes algorithm and Particle Swarm Optimization (PSO) to improve accuracy results.Research using the Naive Bayes technique and the Naive Bayes method based on Particle Swarm Optimization yielded results with the accuracy value of 85.19% for the Naive Bayes method and 74.67% for the Naive Bayes method based on Particle Swarm Optimization.Particle Swarm Optimization was used, and the test results improved by 10.52%.
Research by [11] used Levenshtein Distance (LD) algorithm combined with TF-IDF algorithm to determine a word's weight in a hoax document and the distance between words in the document.Levenshtein Distance Method Application in Hoax Detection System consists of multiple stages: the preprocessing text stage, the calculating stage, and the stage where the Levenshtein Distance method is used to calculate the minimal distance between words.
Research by [12] proposed to add the feature selection process with information gain before classification using the Naïve Bayes method.The result turns out that the addition of this information gain can significantly increase the accu-racy value from 43,14% to 72,71%.It shows that this method is very good to be used as a detection of whether a news is real or fake.
Research by [13] used the feature expansion method known as Global Vectors for Word Representation, or GloVe.On Twitter, the Glove feature expansion technique is employed to decrease the amount of vocabulary mismatches in a tweet.Multiple methods were employed in the classification process, including Support Vector Machine (SVM), Naive Bayes, and Recurrent Neural Network (RNN).The findings demonstrate that, while employing the GloVe Tweet + News corpus and the Top 10, the Hoax detection system with feature expansion achieves an accuracy of 91.92% in the SVM classification approach.
Research by [14] used C4.5 algorithm to categorize news and decide if it is a fake or not.With a 99.6% accuracy rate, the C4.5 method's results exhibit exceptional accuracy.
Research by [15] create and evaluate a system for classifying Twitter hoaxes that uses the Feature Expansion Word2Vec approach in conjunction with Random Forest, SVM, Logistic Regression, and a system that does not use this method.According to the study's findings, the Random Forest classification algorithm using Feature Expansion Word2Vec approach successfully raised the system's accuracy by 1.46%, yielding an accuracy value of 89.53%.
Research by [16] employed a modified version of the K Nearest Neighbor Method and TF-IDF to determine whether the presidential election news was a hoax or not.The precision of 93,75%, recall of 90,90%, accuracy of 92,31%, F-measure of 92,31% were the results of the implementation and testing.
In this research, the linear kernel and n-gram Support Vector Machine (SVM) algorithms will be used in the design of an Indonesian fake news detection system based on news headlines.This research focuses on finding the model that gives the best results.The results of the best model will be used for deployment on the web.

METHOD
To make it easier for readers to know the flow of research methods, a research framework is made.The research framework is shown in Image 1.
The first step in the research framework is data collection.The dataset that will be used in the research is news title data collected using the Beautiful Soup library on the TurnBackHoax.idwebsite as a fake news source and tempo.coas a real news source.Both websites can be seen in Image 2 and Image 3. After this process is done, the two datasets will be merged.Unnecessary attributes, such as news date, news content, and news writer will be deleted.The merged headline dataset is shown in Table 1.After data collection, the next step involves prepro cessing the news data.The data features will then be extracted before being used in model training process.Subsequently, performance evaluation is performed before deployment model is carried out.Case Folding is the first stage of preprocessing.In this stage, all letters contained in the text are converted into lowercase letters.In addition, non letter characters or noise contained in the text will be removed.Examples of noise are punctuation marks, numbers and special characters.Tokenization is the second stage of preprocessing.At this stage, the division of text that can be sentences into tokens/ certain parts is carried out.Stop words removal is the third stage of preprocessing.Stop words usually have little or no meaning, therefore the words were removed.The examples are the words "which", "in", "to", "from".Stop words are removed so that they do not affect the results during news classification.The preprocessing stage culminates in stemming.Removing affixes from a word to reduce it to its root is known as stemming.This study extracts news titles features by using n-gram and term frequency inverse document frequency (TF-IDF).
The study's dataset will be split into two categories: training data and testing data, with a 60:40 training to testing ratio.The Support Vector Machine (SVM) algorithm with a linear kernel is the algorithm that is employed on this research.This research uses confusion matrix, accuracy, precision, recall, and f1-score to illustrate the linear kernel Support Vector Machine model's performance quality.
After the performance evaluation is completed, the deployment model is carried out.To perform deployment, the Python library is used: Pickle to store algorithm models and tools for extracting features, Flask library as a framework for web development, machine learning libraries such as sklearn, NLTK, Sastrawi for preprocessing, feature extraction, and news detection.

News Titles Data
The research dataset used in this study amounted to 2184 news title data consisting of 1080 fake news title data and 1104 original news title data.The data obtained is shown in Table 1.

Preprocessing
After collecting news data, the preprocessing stage is then carried out.The preprocessing stage consists of 4 stages, namely case folding, tokenization, stop words removal and stemming.Table 2 shows the initial news title data and news title data that has passed preprocessing.

Feature Extraction
After the preprocessing stage is complete, news feature extraction is performed.Feature extraction is done with TF-IDF unigram and bigram models.Both models are used in order to see the impact of using N-gram features on the performance of the linear kernel Support Vector Machine model.The performance results of the two models will be compared with each other.An example of the application of n-grams to the research data is shown in Table 3.

Performance Evaluation
At this stage, the Support Vector Machine model was evaluated using four performance testing tools.The first evaluation is done using confusion matrix.The confusion matrix evaluation results for the linear kernel Support Vector Machine model's performance using TF IDF unigram features are displayed in Table 5.In Table 5, it can be seen that there are 439 real news stories that are predicted to be positive or also known as True Positive (TP).Then, there are 3 real news stories that are predicted as fake news or also known as False Negative (FN).Finally, there are 432 real news stories that are predicted to be negative or also known as True Negative (TN).News data classified in the True Negative class is news data classified by the Support Vector Machine model as fake news.In Table 6, it can be seen that there are 431 real news stories that are predicted to be positive or also known as True Positive (TP).Then, there are 11 real news stories that are predicted to be negative or also known as False Negative (FN).After that, there are 119 fake news predicted as real news or also known as False Positive (FP).Finally, there are 313 genuine news stories that are predicted as True Negative (TN).
After doing performance evaluation using confusion matrix, this news data is reused to calculate the performance using accuracy, precision, recall, and f1-score values of the two models.The results of the performance evaluation are shown in Table 7.In Table 7, the linear kernel Support Vector Machine model using TF IDF unigram features shows better performance results than using TF IDF bigram features.The table above shows the difference in accuracy and f1-score values between the two models, which are 0.15 and 0.12.The use of unigram news features as training data also affects the precision and recall values of the linear kernel Support Vector Machine model.
The precision value generated from the model performance test is 1.00.This means that the degree of accuracy in matching the requested information regarding fake news detection with the answers provided by the system is very good.Then the recall value generated is 0.99.This means that the linear kernel Support Vector Machine model using unigram news features is very effective for detecting fake news according to the text classification approach.Therefore, this model will be used in the model deployment stage.

Deployment Model
At this stage, a web-based Indo nesian fake news detection system was developed.
The Web based Indonesian Fake News Detection System works by analyzing news titles inputted as queries by the previously stored model.If the news title entered is detected by the system as a real news title, the system will display a sentence on the news detection page "News Title Detection Results According to the System Are Real".Conversely, the sentence will change into Fake if the system detects the news headline as a fake headline.Image 4 and Image 5, respectively, show the home page and the news detection display page along with the news detection results according to the system.Both system images are attached as follows.

Table 1 .
News Dataset

Table 2 .
News Title After Preprocessing

Table 3 .
Example of Application of N-grams

Table 4 .
Training Time of Two Models

Table 5 .
Confusion Matrix of First Model

Table 6 .
Confusion Matrix of Second Model

Table 7 .
Performance Test Results of Two Support Vector Machine Models