data preprocessing

1. Data Download

This cell downloads the necessary data files from Kaggle. The dataset folder consisted of train and test folders, each of which had a folder containing images of real faces that are not smiling and another folder for smiling images.

Once I downloaded the dataset, I modified the dataset folder so that it had two folders for not-smiling and smiling images without having test and train separation at first. The reason behind this was to have the data raw and have them similar to ordinary datasets from Scikit-learn.

2. Data Formatting

Above code imports necessary libraries and objects to use including tensorflow, numpy, LabelEncoder, and etc. In this code, I create one variable for the directory to be able to access the files saved on my operating system and another variable called categories to be able to distinguish between the two folders named "not_smiling" and "smiling" in the "datasets" folder. X and y are variables that would store every image in a certain format and the corresponding labels, respectively.

The nested for loop helps the program navigate through the paths to the file location to be able to retrieve the images properly for each "not_smiling" and "smiling" image. Once the program enters the first layer of the for loop, it declares a path which concatenates the folder name ("category") to the directory. For example, if the variable category was not_smiling, then the path would be "C:\dev\aicode\smiledetection\datasets\not_smiling". For the second layer then, it starts iterating through the collection of images in the folder. It gets the image path with the same way of concatenation and loads the correct image with the desired width (47) and height (67). Then, the image is transformed to an array of pixel values with shape (47, 67, 3), where 3 corresponds to RGB channels, each ranging from 0 to 255. In the X list, the image is added. To update the y list, it gets the label by reading the image path from right to left skipping a back slash and separating two folders. To make sure that the correct label for the image is added, I put the additional conditional statement.

Lastly, by doing the operation 4000 times in total, the program ends up with an array of data with shape (4000, 47, 67,3) and an array of 4000 labels. The 4th dimensional state of data requirement for using CNN is thereby satisfied.

3. Data for training and testing

Using train_test_split, the program splits the original raw data to training dataset and testing dataset and the target data to the respective labels for the datasets. Since the test_size is 0.20, the training dataset would have 4000*0.80 = 3200 images and the testing dataset would have 800 images. Then each dataset is normalized by dividing the values by 255 to produce a value range from 0 to 1 for better performance in the neural network.

4. Data Visualization

Let's see if the images were accessed properly and everything written above worked by printing out some sample images with matplotlib.pyplot. In this code, I plotted and printed out 8 images from the training dataset with labels.

5. One-hot Encoding

By using LabelEncoder, the not_smiling labels are converted to a numerical value 0 and the smiling labels are converted to 1.