COMPARISON OF SGD, ADADELTA, ADAM OPTIMIZATION IN GENDER CLASSIFICATION USING CNN

: Gender classification is one of the most important tasks of video analysis. A machine learning-based approach was presented to identify male and female facial images with a data set of 2000 images taken from kaggles. This method plays a role in finding the weight value that gives the best output value. This study uses the most appropriate learning rate of each optimization method as a criterion for stopping training. The results showed that the Artificial Neural Network with Adam optimization produced the highest accuracy, which was 91.5% compared to the SGD and ADADELTA optimization methods. Deep Learning techniques that are applied extensively to image recognition used utilize Adam's optimizer method


INTRODUCTION
The Live streaming application has attracted many users, which makes it possible to share videos of daily life with others [1]. This act of sharing constantly generates a large number of real-world videos and most of these videos take precedence over capturing human faces, so facial analysis in videos is becoming increasingly important in real-life appli-cations for video content inspection and recommendation [2]. Classifying facial images in terms of biometric traits, such as gender, age, or ethnicity, has received a lot of attention in the Computer vision literature recently, especially in the context of video surveillance. Gender classification is one of the most important tasks of video analysis [3]. It would be advantageous if a computer system or machine could correctly classify a per-son's gender [4]. For example, a mall buyer's surveillance camera system can be useful for knowing the customer's gender to create the right strategy, or a seller robot can use the right and intelligent approach to communicate with customers based on their gender [5].
Much of the work on gender classification is based on facial features and has proposed several valuable models, in which neural networks, supporting vector machines, and methods of increasing advertising are the most representative [6]. Moghaddam and Yang compared different gender classification methods on the FERET face database and showed that vector engine support has better recognition performance than other classifiers (such as nearest neighbor classifier, linear classifier) [7]. Deep Convolutional Neural Network (CNNs Trained on imagenet dataset) transfer learning works well in identifying male and female gender images and this is also inseparable from the optimization function [8]. In computer vison, CNN has been known as a powerful visual model as well as producing an accurate hierarchy of segmentation features. The model has also been known to make predictions relatively faster than other algorithms while maintaining competitive performance at the same time [9].
The Gradient Descent optimization method is often used for Artificial Neural Network (JST) training. This method plays a role in finding the weight value that gives the best output value [10]. The working principle of the Gradient Descent method is to reduce the Loss value by changing the parameter value step by step. Three optimization methods have been implemented, namely Stochastic Gradient Descent (SGD), ADADELTA, and Adam in the Artificial Neural Network system for the classification of arrhythmia data [11].

METHOD
At this stage of the study, a classification system for male and female sex in the wild was designed to determine accuracy using digital image processing methods. As figure 1 shows the block diagram of the system designed in this study [12].

Image 1. System Diagram Blocks in General
In general, the system diagram block systematics shown in Figure 1 is: 1. Collection of Face imagery 2. The facial image is preprocessed with stages of resizing and data augmentation. 3. Training is a cit-ra learning process with the CNN method to obtain a training image model that will be stored in the database. 4. The test was performed with mod-el classified test imagery data from the trainer image using the CNN method to obtain the results and accuracy of the system.
The data used in this study are secondary data. The data is sourced from kaggle data, because kaggle data has been tested as a dataset [13]. A proven dataset will  Figure 3 shows a flow chart of the augmentation data preprocessing process starting from inputting images, resizing, augmentation to processing results.

Image 3. Augmentation Data Preprocessing Flow Chart
The explanation of the working system of the image preprocessing flowchart is as follows: 1. Image Input is the stage of retrieving image data to be selected before processing. 2. Resize is the process of resizing the image both vertically and horizontally. In this process, resize the image from its original size to an image size of 64 x 64 pixels. 3. Augmentation is the process of processing image data by modifying image data. In this system, the augmentation stage carried out is Random Rotation, Random Horizontal flip. Preprocessing results are the output of images that have gone through the resize process and the augmentation stage.
At the training stage, the learning process is carried out on the image, which then the output results are in the Available online at http://jurnal.stmikroyal.ac.id/index.php/jurteksi form of a model that will be stored for use in the testing process [16]. Model formation is a process of training data imagery training in identifying objects and categorized according to their class, according to figure 4.

Image 4. Training System Stage Flow
Diagram referring to  In this study, the method used is one of the branches of deeplearning algorithms, namely CNN, referring to the LeNet-5 architecture which is very popular and has been tested using 2 layers. In general, the training stage referring to LeNet-5 is shown in figure 4. The image input on the CNN model uses an image with a size of 64 x 64 x 3. Then the input image will be processed through a convolution process and a pooling process. In the first convolution using the number of kernels as many as 10 with a matrix of 3x3 with padding value = valid, ReLu activation is used in this convolution process as Non-Linearity. Then the pooling process is carried out, especially max pooling using a size of 2x2. Then in the second convolution stage using the number of kernels as many as 20 with a 5x5 matrix, still using the ReLU activation function with padding value = valid. Then continued with flatten, which is to change the output of the convolution pro-cess in the form of a matrix into a vector which will then be passed on the classification process using MLP (Multi Layer Perceptron) with the number of neurons in the hidden layer that has been determined. At this stage the SGD Optimization Algorithm, Adam and Adadelta will be applied to nodes for weight and bias optimization with the default Learning rate using the softmax activation function according to the desired number of classes, in this study 2 classes mean having 2 neurons. The class of the image is then classified based on the value of the neurons on the hidden layer by using the softmax activation function. Figure 5 shows a flow chart of the test stage of the system. The testing stage is in the form of a sex classification process by testing test image data and comparing it with the training result model of training training image data stored in the database. The image data taken was 2000 for the original data and then 4000 augmented image data. The image that has been taken will be processed by the CNN algorithm until it then produces a system output in the form of gender class information.

RESULTS AND DISCUSSION
In this section we describe the proposed solution as a selected Convolutional Network (ConvNet) Architecture and discuss related design options, evaluation methods and implementation aspects.  The data set provided by Kaggle has 5000 train data and 1500 test data. All images vary in size by 96 dpi. However, this study used a modification in the number and size because it avoided the process of running data that was too long to become 800 data trains and 400 test data with a size of 64 x 64 pixels. A sample of male images can be seen in Figure  2  Competition is a binary classification problem with the area under the ROC curve between the predicted probability and the observed target as an evaluation matrix, because we want the out-put to be as close as possible to the actual probability, so we use data augmentation i.e. transform before the image is processed. In addition, researchers used the NLL function (loss with Logsoftmax) as the last layer activation function.
To evaluate the model, researchers must optimize the dataset for the right optimizer between the adadelta, SGD, Nadam optimizer methods and Adam's method. In this experiment, researchers used one of the optimization functions, namely Nadam. The researcher observed a fairly good accuracy of 89.5% with a loss of 0.21 shown in figure 4, and the researcher concluded that the optimization method using Nadam was not suitable for this dataset, and we decided to use Adam's optimization method as a substitute. Image  In all the experiments that the researchers have done, the only changes made are to change the model layer and modify the image size on the input.
Because the dataset is relatively small, researchers had to use learning transfer with a set of models using an experimentfocused CNN model using multiple optimized functions [17]. There are some general steps to take, the first step is to make a size modification to the image before it is used as input, this is necessary to avoid too long a process. In this case, the researcher uses the following sequence: a. Global Average Pooling2D layer with sigmoid at convolution. b. Dropout layer. c. And the last layer is a Fully Connected Layer with an output size of 1 (for binary classification) and an activation function (softmax).
To maximize some training examples and improve model accuracy, researchers augmented the data through a number of random transformations that the data augmentation technique chose was: random rotation (150), resize crop scale = (80%) random horizontal flip. Furthermore it is expected that data augmentation should also avoid overfitting, as well as the ability to improve in model generalizations [18]. Researchers have experimented with several optimizer models from python libraries, namely Adam, Adadelta, Nadam and SGD on CNN models with reported results. In all the experiments that researchers have conducted, several different accuracy results have been obtained.

a. Adam
Adam Is a replacement optimization algorithm for stochastic gradient descent for training deep learning models. Adam combines the best properties of the AdaGrad and RMSProp algorithms to provide an optimization algorithm that can handle sparse gradients on noisy problems. In this optimization model, researchers get an accuracy mode of 91.5% with a loss of up to 0.1% with the best learning rate of 1. Adadelta Is a stochastic gradient lowering method that is based on adaptive learning rates per dimension to overcome. In the second model we get an accuracy of 0.90%, as seen in the following figure 11. Image 11. Adadelta Plot Cost and Score Results

c. Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) Is the solution to solve the issue of GD. SGD will update the weight without waiting for 1 epoch to finish. SGD uses a concept similar to batching by dividing the train data into batches [19]. The weight will be updated in each completed batch. In this optimizer method, we get an accuracy of 63% with a still high loss of 0.6 with the best learning rate of 0.01. This is no better than the two previous optimizer models, corresponding in figure 12.
Image 12. Adadelta Plot Cost and Score Results

d. Data Visualization
To see how the model works and what exactly is learned, we chose to visualize intermediate activations consisting of the results of map features ejected by various convolution and unification layers in the network, given a specific optimizer input (the output of a layer is often called activation, the output of the activation function). Output the original image prediction data with predictions as shown in Figure 8, the model learned how to identify male and female genders using Adam's optimizer, especially on the face [20]. Image 13. Image Prediction From figure 13 there are 36 predicted data and produced 32 correct predictions (green writing) and 4 wrong predictions (red writing). From the data, you can find the MSE with a confusion matrix, especially precision manually with the formula: Precision = TP/(TP+FP) = 36/(36+4) = 36/40 = 0,9 setara 90%. The 36 predicted data can also be visualized as shown in figure 14 below: Image 14. Matrix Confusion Visualization Table 2 shows the results of experiments that have been carried out using adam, adadelta, and SGD models in classifying gender.

CONCLUSION
Based on the application of the method that has been carried out, the Adam Optimization method more accurately identifies gender, with an accuracy of 91.5%. There were 36 predicted data and produced 32 correct predictions and 4 false predictions. The application of this method can provide solutions in helping gender classification quickly, and more precisely compared to the SGD method, and Adadelta [21].