PERFORMANCE ANALYSIS RESNET50 AND INCEPTIONV3 MODELS FOR CAPTION IMAGE GENERATOR

: Generating caption image automatically is one of the challenges in computer vision. This field can be very helpful in many ways, for example search engines. Currently there are many image classification algorithms that we can use to create a caption image model. In this article, we will compare performance between the Resnet50 and InceptionV3 models for text images. We will use 2000 (1800 train & 200 validation) image data and each image has 5 example captions to train the model. After the model is successfully created, we evaluate the model using 100 images and each image has 5 examples of additional captions that are not used in the training and validation process. The result of this research is that the InceptionV3 model is better than Resnet50. BLEU-1 is 0.53, BLEU-2 is 0.35, BLEU-3 is 0.18, BLEU-4 is 0.09, and METEOR is 0.35 for InceptionV3 model. While Resnet50 model has a value of BLEU-1 is 0.51, BLEU-2 is 0.31, BLEU-3 is 0.16, BLEU-4 is 0.06, and METEOR is 0.33.


INTRODUCTION
Currently, technology is developing rapidly. For example artificial intelligence, artificial intelligence (AI) is profoundly changing our lives, and it is critical to understand these advances to predict future development strategies [1] [2] [3]. For example, artificial intelligence for caption image generators that can help us understand images. Although it is a very difficult endeavor, being able to automatically describe the substance of an image using well constructed English phrases could have significant effects [4] [5].
The main objective of the subfield of computer vision known as image captioning is to produce an accurate and natural text description of each scenario depicted in an image [6] [7]. Although this is difficult to do, it can have a very good impact, for example helping search engines find relevant images. In this study, we will compare Inceptionv3 and Resnet50. So this research is based on transfer learning, the reason we use transfer learning is to train transformer models in language pairs with high resource availability, transfer learning can replace the need for a warm-up phase [8].
Resnet50 is great for image classification. For the detection of brain tumors, Resnet50 has an accuracy of 92% and 90% [9]. If added with LSTM, then Resnet50 is also good for image captions. The experimental results show the resnet50 model can produce quality image captions automatically [10]. Inceptionv3 is also great for classifying. The experimental results obtained a sensitivity score of 95.41% and a specificity of 80.09%, these results outperform other methods in terms of lung image categorization [11]. If added with LSTM, then Inceptionv3 is also good for image captions. Inceptionv3 can generate captions with a good BLEU score [12].
We will perform feature extraction using these two models. After that, we will insert it to LSTM. We choose LSTM because of its popularity and capacity to remember long-term dependencies in the created word sequence [13]. We will measure the quality of the output from LSTM in human language, we will use BLEU (Bilingual Evaluation Understudy). BLEU can significantly increase the correctness of a final translation [14] [15]. For more convincing which model is better, we will also use the METEOR algorithm.
The final result of this article is which model is the best (Resnet50 and Inceptionv3) to caption images.

METHOD
An important element of research process is research methodology. Image 1 shows the methodology's flow for this study.

Image Preprocessing
We used the Kaggle Flickr8K to see which model is better at handling less training data. After that, we divided the data into 1800 training data and 200 validation data. Image preprocessing is the first thing we do, the image is scaled to the necessary size.
We must execute two conversions with different sizes, because we will be comparing the performance of two different models here. The Resnet50 model will employ the sizes (224, 224, 3). Inceptionv3 will make advantage of size (299, 299, 3).

Caption Preprocessing
Text preprocessing used for data selection to make it more structured [16]. In Flickr8K dataset has 5 captions for each image. We will use the caption as the language vocab model. But we need to do preprocessing first. This stage includes case folding, remove special character, remove number and adding special token at the beginning and ending caption.
Case folding is the process of changing letters in a text into lowercase without changing the meaning or structure of the text. Case folding is a technique used to streamline text processing and boost efficiency. For an example see Image 2.

Image 2. Case folding
After that, the next process is to remove special characters and remove number. The purpose of this process is to improve the quality of captions.
For the process of adding special tokens, we add startseq at the beginning of the sentence, and stopseq at the end of the sentence.

Image 3. Special Token
The final step of caption preprocessing is to replace double spaces with single spaces. This can happen because of the previous process.

Image Feature Extract
There are two steps required to generate an automatic caption for an image [17]. First step is extract information from the image and save it to vector.
We will do two extraction processes using the Resnet50 and InceptionV3 models. Because the goal of this study is not images classification, we remove the last layer of this model for the feature extraction process. We don't need that layer because this one is in charge of classification images into one of 1000 possible groups.

Language Vocab Model
This process includes counting the number of unique words in captions and looking at the maximum possible caption length.

Model Combine Layer
This layer is useful for combining the image feature extract output layer and the language vocab model. The output of this process will be entered into the LSTM layer.
During training, we use a batch 128 and for optimization we use ADAM. We use a large batch size because a large batch size model can minimize loss more effectively than a small batch size model [18]. We train models for 20 epochs. But if the loss validation is no longer reduced, we will stop training process.

LSTM
RNN is a method in deep learning that is used to process sequential data such as sentences [16], [19]. One of the well-known models is the LSTM. LSTM (Long Short Term Memory) is very useful for making sentences from each image feature.
The vector obtained from the image feature extraction process will be entered into the LSTM. After that, LSTM will make a sentence and sentence will be transferred to the next LSTM layer and finally we get the sentence generated for the image.

Evaluate Model
The metric standard used for testing is called BLEU [14]. Because BLEU bases its decisions on n-gram precision, it harshly penalizes lexical deviations even when candidates are synonymous: No credit is given if an ngram's subsequence does not exactly match the reference.
After getting the BLEU value, we will also measure the model output using METEOR. Besides considering the resulting semantic accuracy, METEOR also considers the resulting recall [20].
We will compare the accuracy of machine translation of both models (Resnet50 & InceptionV3) against human reference translation using BLEU & METEOR. After that, we can see which model is better.

RESULT AND DISCUSSION
After the training process, here is the history of training process (accuracy & validation) from Resnet50 and InceptionV3. We tested a model built using 100 images from the flickr8k dataset that were not used for training & validation. The BLEU & METEOR algorithm is used to see which model has the better caption output.