Boost Model Accuracy with Generative Adversarial Networks(GAN)

[ad_1]

This text was printed as part of the Data Science Blogathon.

Introduction

The article covers the usage of Generative Adversarial Networks (GAN), an Oversampling approach on actual phrase skewed Covid-19 knowledge in predicting the danger of mortality. This story provides us a greater understanding of how knowledge preparation steps like dealing with imbalanced knowledge will enhance our mannequin efficiency.

The info and the core mannequin for this text are thought of from the latest research (July 2020) on “COVID-19 Patient Health Prediction Using Boosted Random Forest Algorithm” by Celestine Iwendi, Ali Kashif Bashir, Atharva Peshkar. et al. This research used the Random Forest algorithm boosted by the AdaBoost mannequin and predicted the mortality of particular person sufferers with 94% accuracy. On this article, the identical mannequin and mannequin parameters had been thought of to obviously analyze the development of current mannequin accuracies by utilizing GAN- based mostly Oversampling Approach.

Top-of-the-line methods to study good practices for aspiring Information Scientist could be collaborating in hackathons on totally different boards like Analytics Vidhya, Kaggle, or different. As well as, taking the solved circumstances and knowledge from these boards or printed analysis publications; perceive their methodology, and attempt to enhance accuracy or scale back the error with extra steps. This may type a robust foundation and allow us to assume deeply for the appliance of extra strategies we realized throughout the worth chain of knowledge science.

The info utilized in research had been educated utilizing 222 affected person information with 13 options. The info is biased as 159(72%) circumstances belong to the category `0′ or ‘recovered’. Because of its skewed nature, varied undersampling/oversampling strategies could be utilized to the information. The issue of skewness knowledge might result in overfitting of the prediction mannequin. To beat this limitation many research have applied the usage of oversampling strategies to supply a stability to the dataset, resulting in extra correct mannequin coaching. Oversampling is a method for compensating the imbalance of a dataset, by rising the variety of samples throughout the minority knowledge. Typical strategies embrace Random Oversampling (ROS), Artificial Minority Oversampling Approach (SMOTE) and others could be utilized. For extra data on coping with imbalanced courses utilizing standard strategies, refer to:

Just lately, a machine studying mannequin for growing a generative community based mostly on an adversarial studying idea, particularly the Generative Adversarial Networks (GAN), has been proposed. The attribute of Generative Adversarial Networks (GAN) makes it simply relevant to oversampling research because the nature of the neural community developed based mostly on adversarial coaching permits synthetic knowledge to be made that’s much like the unique knowledge. Oversampling based mostly on Generative Adversarial Networks (GAN) overcomes the restrictions of standard strategies, reminiscent of overfitting, and permits the event of a extremely correct prediction mannequin of imbalanced knowledge.

 

How GAN generate artificial knowledge?

Two neural networks compete in opposition to one another to study the goal distribution and generate synthetic knowledge

A generator community G: mimic coaching samples to idiot the discriminator

A discriminator community D: discriminate coaching samples and generated samples

Generative adversarial networks GAN

Generative adversarial networks are based mostly on a game-theoretic situation by which the generator community should compete in opposition to an adversary.  As GAN learns to imitate the distribution of knowledge, It’s utilized in varied fields reminiscent of music, video, and pure language, and extra lately to imbalanced knowledge issues.

The info and the bottom mannequin used within the research could be discovered here

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
import tensorflow as tf  
from keras.layers import Enter, Dense, Reshape, Flatten, Dropout, BatchNormalization, Embedding  
from keras.layers.advanced_activations import LeakyReLU  
from keras.layers.merge import concatenate  
from keras.fashions import Sequential, Mannequin  
from keras.optimizers import Adam  
from keras.utils import to_categorical  
from keras.layers.advanced_activations import LeakyReLU  
from keras.utils.vis_utils import plot_model  
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder  
import scipy.stats  
import datetime as dt  
import pydot  
import warnings  
warnings.filterwarnings("ignore")  
%matplotlib inline  

df = pd.read_csv('Covid_Train_Oct32020.csv')  
df = df.drop('id',axis=1)  
df = df.fillna(np.nan,axis=0)  
df['age'] = df['age'].fillna(worth=df['age'].imply())  

df['sym_on'] = pd.to_datetime(df['sym_on'])  
df['hosp_vis'] = pd.to_datetime(df['hosp_vis'])  
df['sym_on']= df['sym_on'].map(dt.datetime.toordinal)  
df['hosp_vis']= df['hosp_vis'].map(dt.datetime.toordinal)  
df['diff_sym_hos']= df['hosp_vis'] - df['sym_on']  
df=df.drop(['sym_on', 'hosp_vis'], axis=1)  
df['location'] = df['location'].astype(str)  
df['country'] = df['country'].astype(str)  
df['gender'] = df['gender'].astype(str)  
df['vis_wuhan'] = df['vis_wuhan'].astype(str)  
df['from_wuhan'] = df['from_wuhan'].astype(str)  
df['symptom1'] = df['symptom1'].astype(str)  
df['symptom2'] = df['symptom2'].astype(str)  
df['symptom3'] = df['symptom3'].astype(str)  
df['symptom4'] = df['symptom4'].astype(str)  
df['symptom5'] = df['symptom5'].astype(str)  
df['symptom6'] = df['symptom6'].astype(str)  
df.dtypes

 

Information Description

Column Description Values (for categorical variables) Kind
id Affected person Id NA Numeric
location The situation the place the affected person belongs to A number of cities situated all through the world String, Categorical
nation Affected person’s native nation A number of international locations String, Categorical
gender Affected person’s gender Male, Feminine String, Categorical
age Affected person’s age NA Numeric
sym_on The date affected person began noticing the signs NA Date
hosp_vis Date when the affected person visited the hospital NA Date
vis_wuhan Whether or not the affected person visited Wuhan, China Sure (1), No (0) Numeric, Categorical
from_wuhan Whether or not the affected person belonged to Wuhan, China Sure (1), No (0) Numeric, Categorical
dying Whether or not the affected person handed away because of COVID-19 Sure (1), No (0) Numeric, Categorical
Recov Whether or not the affected person recovered Sure (1), No (0) Numeric, Categorical
symptom1. symptom2, symptom3, symptom4, symptom5, symptom6 Signs observed by the sufferers A number of signs observed by the sufferers String, Categorical

 

The research thought of 11 categorical and a couple of numeric enter options for the evaluation. The goal variable is dying/recovered. A brand new column”diff_sym_hos” has been populated to supply the day’s distinction between the signs being observed and admitted within the hospital. The main focus of the research is on enhancing the minority class knowledge i.e. dying==1, a subset is drawn from the practice knowledge. The subset is separated as categorical and numeric for and handed to the GAN mannequin.

df_minority_data=df.loc[df['death'] == 1]  
 
#Subsetting enter options with out goal variable
df_minority_data_withouttv=df_minority_data.loc[:, df_minority_data.columns != 'death']
numerical_df = df_minority_data_withouttv.select_dtypes("quantity")  
categorical_df = df_minority_data_withouttv.select_dtypes("object")  
scaling = MinMaxScaler()  
numerical_df_rescaled = scaling.fit_transform(numerical_df)  
get_dummy_df = pd.get_dummies(categorical_df)    

#Seperating Every Class
location_dummy_col = [col for col in get_dummy_df.columns if 'location' in col]  
location_dummy = get_dummy_df[location_dummy_col]  
country_dummy_col = [col for col in get_dummy_df.columns if 'country' in col]  
country_dummy = get_dummy_df[country_dummy_col]  
gender_dummy_col = [col for col in get_dummy_df.columns if 'gender' in col]  
gender_dummy = get_dummy_df[gender_dummy_col]  
vis_wuhan_dummy_col = [col for col in get_dummy_df.columns if 'vis_wuhan' in col]  
vis_wuhan_dummy = get_dummy_df[vis_wuhan_dummy_col]  
from_wuhan_dummy_col = [col for col in get_dummy_df.columns if 'from_wuhan' in col]  
from_wuhan_dummy = get_dummy_df[from_wuhan_dummy_col]  
symptom1_dummy_col = [col for col in get_dummy_df.columns if 'symptom1' in col]  
symptom1_dummy = get_dummy_df[symptom1_dummy_col]  
symptom2_dummy_col = [col for col in get_dummy_df.columns if 'symptom2' in col]  
symptom2_dummy = get_dummy_df[symptom2_dummy_col]  
symptom3_dummy_col = [col for col in get_dummy_df.columns if 'symptom3' in col]  
symptom3_dummy = get_dummy_df[symptom3_dummy_col]  
symptom4_dummy_col = [col for col in get_dummy_df.columns if 'symptom4' in col]  
symptom4_dummy = get_dummy_df[symptom4_dummy_col]  
symptom5_dummy_col = [col for col in get_dummy_df.columns if 'symptom5' in col]  
symptom5_dummy = get_dummy_df[symptom5_dummy_col]  
symptom6_dummy_col = [col for col in get_dummy_df.columns if 'symptom6' in col]  
symptom6_dummy = get_dummy_df[symptom6_dummy_col]

Generative adversarial networks GAN

Column Description Values (for categorical variables) Kind
id Affected person Id NA Numeric
location The situation the place the affected person belongs to A number of cities situated all through the world String, Categorical
nation Affected person’s native nation A number of international locations String, Categorical
gender Affected person’s gender Male, Feminine String, Categorical
age Affected person’s age NA Numeric
sym_on The date affected person began noticing the signs NA Date
hosp_vis Date when the affected person visited the hospital NA Date
vis_wuhan Whether or not the affected person visited Wuhan, China Sure (1), No (0) Numeric, Categorical
from_wuhan Whether or not the affected person belonged to Wuhan, China Sure (1), No (0) Numeric, Categorical
dying Whether or not the affected person handed away because of COVID-19 Sure (1), No (0) Numeric, Categorical
Recov Whether or not the affected person recovered Sure (1), No (0) Numeric, Categorical
symptom1. symptom2, symptom3, symptom4, symptom5, symptom6 Signs observed by the sufferers A number of signs observed by the sufferers String, Categorical

 

Defining Generator

The generator takes enter from latent house and generates new artificial samples. The leaky rectified linear activation unit (LeakyReLU) is an efficient apply to make use of in each the generator and the discriminator mannequin for dealing with some detrimental values. It’s used with the default really useful worth of 0.2 and the suitable weight initializer “he_uniform”. Moreover, batch normalization is used throughout totally different layers to standardize the activations (zero imply and unit variance) from a previous layer and stabilize the coaching course of.

Within the output layer, the softmax activation operate is used for categorical variables and sigmoid is used for steady variables.

def define_generator (catsh1,catsh2,catsh3,catsh4,catsh5,catsh6,catsh7,catsh8,catsh9,catsh10,catsh11,numerical):    
  #Inputting noise  from latent house
    noise = Enter(form = (70,))    
    hidden_1 = Dense(8, kernel_initializer = "he_uniform")(noise)    
    hidden_1 = LeakyReLU(0.2)(hidden_1)    
    hidden_1 = BatchNormalization(momentum = 0.8)(hidden_1)    
    hidden_2 = Dense(16, kernel_initializer = "he_uniform")(hidden_1)    
    hidden_2 = LeakyReLU(0.2)(hidden_2)    
    hidden_2 = BatchNormalization(momentum = 0.8)(hidden_2)    

    #Department 1 for producing location knowledge

    branch_1 = Dense(32, kernel_initializer = "he_uniform")(hidden_2)    
    branch_1 = LeakyReLU(0.2)(branch_1)    
    branch_1 = BatchNormalization(momentum = 0.8)(branch_1)    
    branch_1 = Dense(64, kernel_initializer = "he_uniform")(branch_1)    
    branch_1 = LeakyReLU(0.2)(branch_1)    
    branch_1 = BatchNormalization(momentum=0.8)(branch_1)    
 
    #Output Layer1
    branch_1_output = Dense(catsh1, activation = "softmax")(branch_1)    

    #Likewise, for all remaining 10 classes branches can be outlined    
    #Department 12 for producing numerical knowledge 
    branch_12 = Dense(64, kernel_initializer = "he_uniform")(hidden_2)    
    branch_12 = LeakyReLU(0.2)(branch_3)    
    branch_12 = BatchNormalization(momentum=0.8)(branch_12)    
    branch_12 = Dense(128, kernel_initializer = "he_uniform")(branch_12)    
    branch_12 = LeakyReLU(0.2)(branch_12)    
    branch_12 = BatchNormalization(momentum=0.8)(branch_12)    
    
    #Output Layer12 
    branch_12_output = Dense(numerical, activation = "sigmoid")(branch_12)    

    #Mixed output 
    combined_output = concatenate([branch_1_output, branch_2_output, branch_3_output,branch_4_output,branch_5_output,branch_6_output,branch_7_output,branch_8_output,branch_9_output,branch_10_output,branch_11_output,branch_12_output])    

    #Return mannequin 

    return Mannequin(inputs = noise, outputs = combined_output)    

    
generator = define_generator(location_dummy.form[1],country_dummy.form[1],gender_dummy.form[1],vis_wuhan_dummy.form[1],from_wuhan_dummy.form[1],symptom1_dummy.form[1],symptom2_dummy.form[1],symptom3_dummy.form[1],symptom4_dummy.form[1],symptom5_dummy.form[1],symptom6_dummy.form[1],numerical_df_rescaled.form[1])  
generator.abstract()

 

Defining Discriminator

The discriminator mannequin will take a pattern from our knowledge, reminiscent of a vector, and output a classification prediction as as to if the pattern is actual or pretend. This can be a binary classification downside, so sigmoid activation is used within the output layer and binary cross-entropy loss operate is utilized in mannequin compilation. The Adam optimization algorithm with the educational price LR of 0.0002 and the really useful beta1 momentum worth of 0.5 is used.

def define_discriminator(inputs_n):  
    #Enter from generator
    d_input = Enter(form = (inputs_n,))    
    d = Dense(128, kernel_initializer="he_uniform")(d_input)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(64, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(32, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(16, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  
    d = Dense(8, kernel_initializer="he_uniform")(d)  
    d = LeakyReLU(0.2)(d)  

    #Output Layer
    d_output = Dense(1, activation = "sigmoid")(d)  

    #compile and return mannequin
    mannequin = Mannequin(inputs = d_input, outputs = d_output)  
    mannequin.compile(loss = "binary_crossentropy", optimizer = Adam(lr=0.0002, beta_1=0.5), metrics = ["accuracy"])  
    return mannequin  


inputs_n = location_dummy.form[1]+country_dummy.form[1]+gender_dummy.form[1]+vis_wuhan_dummy.form[1]+from_wuhan_dummy.form[1]+symptom1_dummy.form[1]+symptom2_dummy.form[1]+symptom3_dummy.form[1]+symptom4_dummy.form[1]+symptom5_dummy.form[1]+symptom6_dummy.form[1]+numerical_df_rescaled.form[1]  
discriminator = define_discriminator(inputs_n)  
discriminator.abstract()

Combining generator and discriminator collectively as a GAN mannequin and finishing the coaching. Thought of 7,000 epoch’s and thought of an entire batch of minority class knowledge for coaching.

Def define_complete_gan(generator, discriminator):  
    discriminator.trainable = False  
    gan_output = discriminator(generator.output)  
    
    #Initialize gan
    mannequin = Mannequin(inputs = generator.enter, outputs = gan_output)  

    #Mannequin Compilation
    mannequin.compile(loss = "binary_crossentropy", optimizer = Adam(lr=0.0002, beta_1=0.5))  
    return mannequin  

completegan = define_complete_gan(generator, discriminator)  

def gan_train(gan, generator, discriminator, catsh1,catsh2,catsh3,catsh4,catsh5,catsh6,catsh7,catsh8,catsh9,catsh10,catsh11,numerical, latent_dim, n_epochs, n_batch, n_eval):  
    #Upddte Discriminator with half batch dimension
    half_batch = int(n_batch / 2)  
    discriminator_loss = []  
    generator_loss = []  
    #generate class labels for pretend and actual
    legitimate = np.ones((half_batch, 1))  
    y_gan = np.ones((n_batch, 1))  
    pretend = np.zeros((half_batch, 1))  

    #coaching
    for i in vary(n_epochs):  
        #choose random batch from actual categorical and numerical knowledge
        idx = np.random.randint(0, catsh1.form[0], half_batch)       
        location_real = catsh1[idx]  
        country_real = catsh2[idx]  
        gender_real = catsh3[idx]  
        vis_wuhan_real = catsh4[idx]  
        from_wuhan_real = catsh5[idx]  
        symptom1_real = catsh6[idx]  
        symptom2_real = catsh7[idx]  
        symptom3_real = catsh8[idx]  
        symptom4_real = catsh9[idx]   
        symptom5_real = catsh10[idx]  
        symptom6_real = catsh11[idx]          
        numerical_real = numerical_df_rescaled[idx]  

        #concatenate categorical and numerical knowledge for the discriminator
        real_data = np.concatenate([location_real, country_real, gender_real,vis_wuhan_real,from_wuhan_real,symptom1_real,symptom2_real,symptom3_real,symptom4_real,symptom5_real,symptom6_real,numerical_real], axis = 1)  
  
        #generate pretend samples from the noise
        noise = np.random.regular(0, 1, (half_batch, latent_dim))  
        fake_data = generator.predict(noise)  
     
        #practice the discriminator and return losses and acc
        d_loss_real, da_real = discriminator.train_on_batch(real_data, legitimate)  
        d_loss_fake, da_fake = discriminator.train_on_batch(fake_data, pretend)  
        d_loss = 0.5 * np.add(d_loss_real, d_loss_fake)  
        discriminator_loss.append(d_loss)  
       
        #generate noise for generator enter and practice the generator (to have the discriminator label samples as legitimate)
        noise = np.random.regular(0, 1, (n_batch, latent_dim))  
        g_loss = gan.train_on_batch(noise, y_gan)  
        generator_loss.append(g_loss)  

        #consider progress
        if (i+1) % n_eval == 0:  
            print ("Epoch: %d [Discriminator loss: %f] [Generator loss: %f]" % (i + 1, d_loss, g_loss))  
            
    plt.determine(figsize = (20, 10))  
    plt.plot(generator_loss, label = "Generator loss")  
    plt.plot(discriminator_loss, label = "Discriminator loss")     
    plt.title("Stats from coaching GAN")  
    plt.grid()  
    plt.legend()  

latent_dim = 100  
gan_train(completegan, generator, discriminator, location_dummy.values,country_dummy.values,gender_dummy.values,vis_wuhan_dummy.values,from_wuhan_dummy.values,symptom1_dummy.values,symptom2_dummy.values,symptom3_dummy.values,symptom4_dummy.values,symptom5_dummy.values,symptom6_dummy.values,numerical_df_rescaled, latent_dim, n_epochs = 7000, n_batch = 63, n_eval = 200)

The educated mannequin is used for producing extra 96 information of a minority class to make an equal cut up (159) of every class.  Now evaluating generated numeric knowledge with unique knowledge imply, normal deviation, and variance; and categorical knowledge are in contrast on the idea of the depend of every class.

noise = np.random.regular(0, 1, (96, 100))  
generated_mixed_data = generator.predict(noise)  
columns=record(location_dummy.columns)+record(country_dummy.columns)+record(gender_dummy.columns)+record(vis_wuhan_dummy.columns)+record(from_wuhan_dummy.columns)+record(symptom1_dummy.columns)+record(symptom2_dummy.columns)+record(symptom3_dummy.columns)+record(symptom4_dummy.columns)+record(symptom5_dummy.columns)+record(symptom6_dummy.columns)+record(numerical_df.columns)  
mixed_gen_df = pd.DataFrame(knowledge = generated_mixed_data, columns = columns)  
mixed_gen_df.iloc[:,:-3] = np.spherical(mixed_gen_df.iloc[:,:-3])  
mixed_gen_df.iloc[:,-2:] = scaling.inverse_transform(mixed_gen_df.iloc[:,-2:])  

#Unique Information
original_df = pd.concat([location_dummy,country_dummy,gender_dummy,vis_wuhan_dummy,from_wuhan_dummy,symptom1_dummy,symptom2_dummy,symptom3_dummy,symptom4_dummy,symptom5_dummy,symptom6_dummy,numerical_df], axis = 1)  
def normal_distribution(org, noise):  
    org_x = np.linspace(org.min(), org.max(), len(org))  
    noise_x = np.linspace(noise.min(), noise.max(), len(noise))  
    org_y = scipy.stats.norm.pdf(org_x, org.imply(), org.std())  
    noise_y = scipy.stats.norm.pdf(noise_x, noise.imply(), noise.std())  
    n, bins, patches = plt.hist([org, noise], density = True, alpha = 0.5, colour = ["green", "red"])  
    xmin, xmax = plt.xlim()  
    plt.plot(org_x, org_y, colour = "inexperienced", label = "Unique knowledge", alpha = 0.5)  
    plt.plot(noise_x, noise_y, colour = "purple", label = "Generated knowledge", alpha = 0.5)  
    title = f"Unique knowledge imply {np.spherical(org.imply(), 4)}, Unique knowledge std {np.spherical(org.std(), 4)}, Unique knowledge var {np.spherical(org.var(), 4)}nGenerated knowledge imply {np.spherical(noise.imply(), 4)}, Generated knowledge {np.spherical(noise.std(), 4)}, Generated knowledge var {np.spherical(noise.var(), 2)}"
    plt.title(title)  
    plt.legend()  
    plt.grid()  
    plt.present()  

Numeric_columns=numerical_df.columns  

for column in numerical_df.columns:  
    print(column, "Comparability between Unique Information and Generated Information")  
    normal_distribution(original_df, mixed_gen_df)

 

Age Comparability between Unique Information and Generated Information

Diff_sym_hos Comparability between Unique Information and Generated Information Comparability between Unique Information and Generated Information

Random Classes Comparability between Unique Information and Generated Information Comparability between Unique Information and Generated Information

Function Unique Information Generated Information
0 1 0 1
location_Hokkaido 61 2 95 1
gender_female 49 14 60 36
symptom2_ cough 62 1 96 0

 

The info generated from the GAN oversampling technique is sort of much like the unique knowledge which has an error of roughly 1%. For just a few uncommon classes, knowledge just isn’t generated throughout all class values.

Following the identical knowledge preparation steps as talked about within the unique research to see how mannequin efficiency has elevated from the unique methodology by utilizing GAN oversampling. One sizzling coded knowledge of the generated pattern is transformed to the unique knowledge body format.

# Getting Again Categorical Information in Original_Format from Dummies
location_filter_col = [col for col in mixed_gen_df if col.startswith('location')]  
location=mixed_gen_df[location_filter_col]   
location= pd.get_dummies(location).idxmax(1)  
location= location.exchange('location_', '', regex=True)  
df_generated_data = pd.DataFrame()   
df_generated_data['location']=location  

country_filter_col = [col for col in mixed_gen_df if col.startswith('country')]  
nation=mixed_gen_df[country_filter_col]   
nation= pd.get_dummies(nation).idxmax(1)  
nation= nation.exchange('country_', '', regex=True)  
df_generated_data['country']=nation  

gender_filter_col = [col for col in mixed_gen_df if col.startswith('gender')]  
gender=mixed_gen_df[gender_filter_col]   
gender= pd.get_dummies(gender).idxmax(1)  
gender= gender.exchange('gender_', '', regex=True)  
df_generated_data['gender']=gender  

vis_wuhan_filter_col = [col for col in mixed_gen_df if col.startswith('vis_wuhan')]  
vis_wuhan=mixed_gen_df[vis_wuhan_filter_col]   
vis_wuhan= pd.get_dummies(vis_wuhan).idxmax(1)  
vis_wuhan= vis_wuhan.exchange('vis_wuhan_', '', regex=True)  
df_generated_data['vis_wuhan']=vis_wuhan  

from_wuhan_filter_col = [col for col in mixed_gen_df if col.startswith('from_wuhan')]  
from_wuhan=mixed_gen_df[from_wuhan_filter_col]   
from_wuhan= pd.get_dummies(from_wuhan).idxmax(1)  
from_wuhan= from_wuhan.exchange('from_wuhan_', '', regex=True)  
df_generated_data['from_wuhan']=from_wuhan  

symptom1_filter_col = [col for col in mixed_gen_df if col.startswith('symptom1')]  
symptom1=mixed_gen_df[symptom1_filter_col]   
symptom1= pd.get_dummies(symptom1).idxmax(1)  
symptom1= symptom1.exchange('symptom1_', '', regex=True)  
df_generated_data['symptom1']=symptom1  

symptom2_filter_col = [col for col in mixed_gen_df if col.startswith('symptom2')]  
symptom2=mixed_gen_df[symptom2_filter_col]   
symptom2= pd.get_dummies(symptom2).idxmax(1)  
symptom2= symptom2.exchange('symptom2_', '', regex=True)  
df_generated_data['symptom2']=symptom2  

symptom3_filter_col = [col for col in mixed_gen_df if col.startswith('symptom3')]  
symptom3=mixed_gen_df[symptom3_filter_col]   
symptom3= pd.get_dummies(symptom3).idxmax(1)  
symptom3= symptom3.exchange('symptom3_', '', regex=True)  
df_generated_data['symptom3']=symptom3  

symptom4_filter_col = [col for col in mixed_gen_df if col.startswith('symptom4')]  
symptom4=mixed_gen_df[symptom4_filter_col]   
symptom4= pd.get_dummies(symptom4).idxmax(1)  
symptom4= symptom4.exchange('symptom4_', '', regex=True)  
df_generated_data['symptom4']=symptom4  

symptom5_filter_col = [col for col in mixed_gen_df if col.startswith('symptom5')]  
symptom5=mixed_gen_df[symptom5_filter_col]   
symptom5= pd.get_dummies(symptom5).idxmax(1)  
symptom5= symptom5.exchange('symptom5_', '', regex=True)  
df_generated_data['symptom5']=symptom5  

symptom6_filter_col = [col for col in mixed_gen_df if col.startswith('symptom6')]  
symptom6=mixed_gen_df[symptom6_filter_col]   
symptom6= pd.get_dummies(symptom6).idxmax(1)  
symptom6= symptom6.exchange('symptom6_', '', regex=True)  
df_generated_data['symptom6']=symptom6  

df_generated_data['death']=1  
df_generated_data['death']=1  

df_generated_data[['age','diff_sym_hos']]=mixed_gen_df[['age','diff_sym_hos']]  
df_generated_data = df_generated_data.fillna(np.nan,axis=0)  

#Encoding Information
encoder_location = preprocessing.LabelEncoder()  
encoder_country = preprocessing.LabelEncoder()  
encoder_gender = preprocessing.LabelEncoder()  
encoder_symptom1 = preprocessing.LabelEncoder()  
encoder_symptom2 = preprocessing.LabelEncoder()  
encoder_symptom3 = preprocessing.LabelEncoder()  
encoder_symptom4 = preprocessing.LabelEncoder()  
encoder_symptom5 = preprocessing.LabelEncoder()  
encoder_symptom6 = preprocessing.LabelEncoder()  

# Loading and Getting ready Information
df = pd.read_csv('Covid_Train_Oct32020.csv')  
df = df.drop('id',axis=1)  
df = df.fillna(np.nan,axis=0)  
df['age'] = df['age'].fillna(worth=tdata['age'].imply())  
df['sym_on'] = pd.to_datetime(df['sym_on'])  
df['hosp_vis'] = pd.to_datetime(df['hosp_vis'])  
df['sym_on']= df['sym_on'].map(dt.datetime.toordinal)  
df['hosp_vis']= df['hosp_vis'].map(dt.datetime.toordinal)  
df['diff_sym_hos']= df['hosp_vis'] - df['sym_on']  
df = df.drop(['sym_on','hosp_vis'],axis=1)  
      
df['location'] = encoder_location.fit_transform(df['location'].astype(str))  
df['country'] = encoder_country.fit_transform(df['country'].astype(str))  
df['gender'] = encoder_gender.fit_transform(df['gender'].astype(str))  
df[['symptom1']] = encoder_symptom1.fit_transform(df['symptom1'].astype(str))  
df[['symptom2']] = encoder_symptom2.fit_transform(df['symptom2'].astype(str))  
df[['symptom3']] = encoder_symptom3.fit_transform(df['symptom3'].astype(str))  
df[['symptom4']] = encoder_symptom4.fit_transform(df['symptom4'].astype(str))  
df[['symptom5']] = encoder_symptom5.fit_transform(df['symptom5'].astype(str))  
df[['symptom6']] = encoder_symptom6.fit_transform(df['symptom6'].astype(str))  

# Encoding Generated Information  
df_generated_data['location'] = encoder_location.rework(df_generated_data['location'].astype(str))  
df_generated_data['country'] = encoder_country.rework(df_generated_data['country'].astype(str))  
df_generated_data['gender'] = encoder_gender.rework(df_generated_data['gender'].astype(str))  
df_generated_data[['symptom1']] = encoder_symptom1.rework(df_generated_data['symptom1'].astype(str))  
df_generated_data[['symptom2']] = encoder_symptom2.rework(df_generated_data['symptom2'].astype(str))  
df_generated_data[['symptom3']] = encoder_symptom3.rework(df_generated_data['symptom3'].astype(str))  
df_generated_data[['symptom4']] = encoder_symptom4.rework(df_generated_data['symptom4'].astype(str))  
df_generated_data[['symptom5']] = encoder_symptom5.rework(df_generated_data['symptom5'].astype(str))  
df_generated_data[['symptom6']] = encoder_symptom6.rework(df_generated_data['symptom6'].astype(str))  
df_generated_data[['diff_sym_hos']] = df_generated_data['diff_sym_hos'].astype(int)

 

Mannequin Comparability

After splitting the unique knowledge into practice and take a look at, generated knowledge from GAN is added to the practice knowledge to check the efficiency with the bottom mannequin. The mannequin efficiency is examined on the precise (unique) cut up take a look at knowledge.

from sklearn.metrics import recall_score as rs  
from sklearn.metrics import precision_score as ps  
from sklearn.metrics import f1_score as fs  
from sklearn.metrics import balanced_accuracy_score as bas  
from sklearn.metrics import confusion_matrix as cm  
import numpy as np  
import pandas as pd  
import datetime as dt  
import sklearn  
from scipy import stats  
from sklearn import preprocessing  
from sklearn.model_selection import GridSearchCV  
from sklearn.ensemble import RandomForestClassifier  
from sklearn.ensemble import AdaBoostClassifier  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import recall_score as rs  
from sklearn.metrics import precision_score as ps  
from sklearn.metrics import f1_score as fs  
from sklearn.metrics import log_loss  

rf = RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,  
criterion='gini', max_depth=2, max_features='auto',  
max_leaf_nodes=None, max_samples=None,  
min_impurity_decrease=0.0, min_impurity_split=None,  
min_samples_leaf=2, min_samples_split=2,  
min_weight_fraction_leaf=0.0, n_estimators=100,  
n_jobs=None, oob_score=False, random_state=None,  
verbose=0, warm_start=False)  
classifier = AdaBoostClassifier(rf,50,0.01,'SAMME.R',10)  

#Seperate TV in Generated Information
X1 = df_generated_data.loc[:, df_generated_data.columns != 'death']  
Y1 = df_generated_data['death']  

#Seperate TV in Unique Information
X = df.loc[:, df.columns != 'death']  
Y = df['death']  

#Splitting Unique Information
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=0)  

#Appending Generated Information to X_train
X_train1=X_train.append(X1, type=False)  
Y_train1=Y_train.append(Y1)  
classifier.match(X_train1,np.array(Y_train1).reshape(Y_train1.form[0],1))  
pred = np.array(classifier.predict(X_test))  

recall = rs(Y_test,pred)  
precision = ps(Y_test,pred)  
r1 = fs(Y_test,pred)  
ma = classifier.rating(X_test,Y_test)  
print('*** Analysis metrics for take a look at dataset ***n')  
print('Recall Rating: ',recall)  
print('Precision Rating: ',precision)  
print('F1 Rating: ',f1)  
print('Accuracy: ',ma)
Metric Rating of Base Mannequin* Rating with Augmented Generated Information
Recall Rating 0.75 0.83
Precision Rating 1 1
F1 Rating 0.86 0.9
Accuracy 0.9 0.95
*Source: Table 3 Base Model Metrics

 

Conclusion

The proposed mannequin offers a extra correct and strong outcome in comparison with that of the based mostly mannequin, displaying that GAN-based oversampling overcomes the restrictions of the imbalanced knowledge and it appropriately inflates the minority class.

Concerning the Writer

Bala Gangadhara Thilak Adiboina

I’m at the moment working as a knowledge scientist with a number one US Telecom Firm. I’m a hardcore knowledge science man who loves to resolve each downside utilizing knowledge science. I’m at the moment pursuing my Ph.D. from IIM Ranchi within the knowledge science house.

 

You may also learn this text on our Cellular APP Get it on Google Play

Associated Articles

[ad_2]

Source link

Write a comment