Developing AI/ML applications in insurance

Developing AI/ML applications in insurance

Key Takeaways
  • Tensor Flow is the most popular AI/ML Framework for developing AI/ML applications
  • AI/ML techniques are applied in the following insurance processes
    • Premium Pricing
    • Risk Adjustment for health plans
    • Claim Processing
    • Fraud Detection

In this article, we discuss multiple AI/ML Frameworks and compare the features. TensorFlow framework comes out as the popular and best framework based on the features. We will show how to develop AI/ML applications in insurance verticals for typical uses like premium pricing, risk adjustment for health plans, claim processing, and fraud detection.


AI/ML Frameworks 

AI and ML Frameworks are getting popular as the applications and solutions are increasing in different enterprises. Among them, TensorFlow is the best open source software package which has features like deep learning, dashboard, analytics, and mobile access. It has the framework for training, testing, and validating data sets for Machine Learning algorithms. It has other features such as multi programming interface, good documentation, GPU support, computational graph abstraction, and popular support from open-source community. It does not have pre-trained models.

Caffe is another framework which has MATLAB support, CNN modeling, plaintext schemas support, and active open source community. Amazon ML has support for self learning components. Developers can execute data analysis, model training, and evaluation. It has features such as data encryption, integration support, and APIs. It does not have support for data visualization. Torch is another framework which has inbuilt algorithms for deep learning networks. PyTorch is the python version. It has GPU support, IOS/Android integration, optimization routines, and array index/slice/transpose routines. PyTorch lacks documentation and is tough for beginners to learn. It started off with Lua support, which many programmers were not knowledgeable. We use Apache Mahout for creating applications related to data analytics and data engineering. It has support for Scala DSL and Multiple backend support. It has features related to clustering, collaborative filtering, and classification. Spark Mlib is faster than Mahout and developers integrate with Hadoop. It has support for Java, R, Scala, and Python.


AI/ML Applications – Use Cases  

Let us look at AI and ML Applications in Insurance vertical. We will look at accident damage assessment in auto insurance, healthcare insurance premium prediction, and fraud detection in health care claim processing.


Insurance Damage assessment

This article presents the AI/ML application for accident damage assessment in auto insurance. We apply computer vision and image processing for the damage assessment. You can use Multi-Instance Learning with ResNet Convolution neural network model for image recognition.

Insurance provider for auto insurance processes the auto accident claims and covers the damage cost through various plans. The process starts off with auto shop sharing the estimate and customer posts the claim to the auto insurance from the auto owner.

You need to have a training model with auto car parts, damaged parts, and support for different models. The model helps in predicting the cost of the damage and helps in claim processing. The repair of the damage can change the part completely or partial repair. Repair estimation uses the metadata models. The cost prediction uses the visual models. Multi instance learning method verifies the images, and we capture them in different perspectives. This method helps in improving the accuracy.

You can capture in Auto insurance models various factors listed below:

  • Demographics
    • Age
    • Gender
    • Region Code Type
    • Income Level
    • Family Type
  • Vehicle Age
  • Vehicle Damage
  • Policy premium
  • Policy sourcing channel

You need to have an extensive data set to improve the accuracy of the estimation and train the model. You can have separate data sets for testing and validating. An insurance claim will use around 40-50 images for prediction. Typical challenges in this estimation process are real time inference support, high data volume, and traffic.


Insurance premium detection

This article looks into health care plans for premium prediction and risk adjustment. We typically use Tweedie Generalized Linear Model for insurance premium prediction. You can also use Lasso Regression.

Health insurance offers multiple policy plans and they have a model to forecast the retention of their customers who are the policyholders. Insurance policy has the coverage details related to compensation specific to :

  • Loss
  • Damage
  • Illness
  • Death

Compensation is specific to the premium paid by the insurance policyholder. Each policy plan has coverage details for various health conditions and treatments. Insurers charge premium at frequency agreed upon by the customer and insurance provider. We base the modeling on two different methods, which are cost prediction based on last twelve months and binary classification based on condition of member’s policy cost exceeding a limit.

You need to ensure that the data used for policy model has the below information:

  • Personal Risk Factors
    • Personal Data
    • Eligibility
    • Enrollment Coverage
    • Social Determinants of Health


  • Clinical History
    • Procedures
    • Diagnosis
    • Drugs
    • Clinical Events
    • Derived Indices
  • Cost History
    • Cost Indices
    • Cost Trends

Personal data of the customer needs to have the following factors:

  • Age
  • Gender
  • Family Size
  • Industry
  • Income Level


Social determinants of health should have the following features :

  • Social Vulnerability
  • Education
  • Poverty
  • Minority Status
  • English-Speaking ability
  • Housing
  • Transportation


Clinical events need to have the following event types:

  • Emergency Department
  • Ambulatory
  • Hospital Impatient


You need to add derived indices related to clinical history which are the following :

  • Mortality rate
  • actuarial life expectancy
  • Years of Life Lost


Let us see the Factors for Cost history for claims which are listed below:


  • Cost over the past one year and two years
  • Total cost in the current month and last twelve months
  • Cost of Speciality drugs
  • Days of Drug supply


Now let us see the Claim Cost Trends, which are shown below:


  • Changes in cost over 1 year
  • Changes in 6 and 3-month intervals
  • Predicted one year cost

Insurance claims fraud detection

This section in the article talks about fraud detection applications in health care claims using AI/ML. We detect fraud in the health claims using AI/ML techniques. Typically, it is tough to get data for claims fraud analysis. Supervised ML technique with data labels is not suitable. We use Un supervised ML technique where the training of the neural network happens using unlabelled data. You can group the data and identify patterns for grouping. We specify patterns as anomalies for detection in the data. You can use Auto-encoders for data translation to a learned representation. They are used to finding anomalies in data which have higher error rate.

You can create data for valid and fraudulent claims when requirements do not allow the usage of public data. We use training, testing, and validation data sets with a mix of valid and fraudulent claims.


Tensor Flow – Implementation

Now, let us look at implementing the above use cases with TensorFlow framework. TensorFlow has features for training the models using data input, pre-trained models support, and visualization. In insurance, Tweedie GLM is popular for premium prediction.


Python Code Samples for Insurance Damage Assessment

Let us look at using Tensorflow and Keras for detecting auto damage and assessment for insurance coverage.

Let us look at the data preprocessing for detecting  auto damage in insurance.

Data PreProcessing

import os

import sys

import json

import datetime

import numpy as nump

import skimage.draw

ROOT_DIR = ROOT_DIR = os.getcwd()

sys.path.append(ROOT_DIR)  # To find local version of the library

from CNNModel.config import Config

from CNNModel import model as modelpack, utils

if __name__ == ‘__main__’:

import argparse

parser = argparse.ArgumentParser(

description=’Train Mask R-CNN to detect the damage.’)



help=”‘train’ or ‘splash'”)

parser.add_argument(‘–dataset’, required=False,


help=’Directory of the custom dataset’)

parser.add_argument(‘–weights’, required=True,


help=”Path to weights .h5 file or ‘coco'”)

parser.add_argument(‘–logs’, required=False,



help=’Logs and checkpoints directory (default=logs/)’)

parser.add_argument(‘–image’, required=False,

metavar=”path or URL to image”,

help=’Image to apply the color splash effect on’)

args = parser.parse_args()

if args.command == “trainmodel”:

assert args.dataset, “Argument –dataset is required for training the model”

elif args.command == “splashapplying”:

assert args.image or,\

“Provide –image or –video to apply the color splash”

print(“Weights are “, args.weights)

print(“Dataset are “, args.dataset)

print(“Logs are “, args.logs)

if args.command == “trainodel”:

config = CustomConfig()


class InferenceConfig(CustomConfig):



config = InferenceConfig()


if args.command == “trainmodel”:

model = modelpack.RCNNModel(mode=”trainingmodel”, config=config,



model = modelpack.RCNNModel(mode=”inferencemodel”, config=config,


if args.weights.lower() == “coco”:

weights_path = COCO_WEIGHTS_PATH

if not os.path.exists(weights_path):


elif args.weights.lower() == “last”:

weights_path = model.findlastCheckPoint()[1]

elif args.weights.lower() == “imagenet”:

weights_path = model.FindImagenetWeights()


weights_path = args.weights

print(“Loading the weights “, weights_path)

if args.weights.lower() == “coco”:

model.loadModelWeights(weights_path, by_name=True, exclude=[

“CNNModel_class_logits”, “CNNModel_bbox_fc”,

“CNNModel_bbox”, “CNNModel_mask”])


model.loadModelWeights(weights_path, by_name=True)

if args.command == “trainmodel”:


elif args.command == “splashapplying”:

detect_and_color_splash(model, image_path=args.image,


print(“‘{}’ is not recognized. ”

“Use ‘trainmodel’ or ‘splashapplying'”.format(args.command))


RCNN Model

You can use RCNN Model for vehicle damage detection.

class RCNNModel():

def __init__(self, mode, config, model_dir):

assert mode in [‘trainingmodel’, ‘inferencemodel’]

self.mode = mode

self.config = config

self.model_dir = model_dir


self.keras_model = self.buildModel(mode=mode, config=config)

def buildModel(self, mode, config):

assert mode in [‘trainingmodel’, ‘inferencemodel’]

h, w = config.IMAGE_SHAPE[:2]

if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):

raise Exception(“Image size must be dividable by 2 at least 6 times ”

“to avoid fractions when downscaling and upscaling.”

“For example, use 256, 320, 384, 448, 512, … etc. “)

Inputs for the model are created and initalized below.

inumput_image = kerasLayer.Inumput(

shape=[None, None, config.IMAGE_SHAPE[2]], name=”inumput_image”)

inumput_image_meta = kerasLayer.Inumput(shape=[config.IMAGE_META_SIZE],


if mode == “trainingmodel”:

inumput_rpn_match = kerasLayer.Inumput(

shape=[None, 1], name=”inumput_rpn_match”, dtype=tensorf.int32)

inumput_rpn_bbox = kerasLayer.Inumput(

shape=[None, 4], name=”inumput_rpn_bbox”, dtype=tensorf.float32)

inumput_gt_class_ids = kerasLayer.Inumput(

shape=[None], name=”inumput_gt_class_ids”, dtype=tensorf.int32)

inumput_gt_boxes = kerasLayer.Inumput(

shape=[None, 4], name=”inumput_gt_boxes”, dtype=tensorf.float32)

gt_boxes = kerasLayer.Lambda(lambda x: applyNorm(

x, BackendKeras.shape(inumput_image)[1:3]))(inumput_gt_boxes)

if config.USE_MINI_MASBackendKeras:

inumput_gt_masks = kerasLayer.Inumput(


config.MINI_MASBackendKeras_SHAPE[1], None],

name=”inumput_gt_masks”, dtype=bool)


inumput_gt_masks = kerasLayer.Inumput(

shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None],

name=”inumput_gt_masks”, dtype=bool)

elif mode == “inferencemodel”:

inumput_anchors = kerasLayer.Inumput(shape=[None, 4], name=”inumput_anchors”)

if callable(config.BACBackendKerasBONE):

_, C2, C3, C4, C5 = config.BACBackendKerasBONE(inumput_image, stage5=True,



_, C2, C3, C4, C5 = resnet_graph(inumput_image, config.BACBackendKerasBONE,

stage5=True, train_bn=config.TRAIN_BN)

P5 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name=’fpn_c5p5′)(C5)

P4 = kerasLayer.Add(name=”fpn_p4add”)([

kerasLayer.UpSampling2D(size=(2, 2), name=”fpn_p5upsampled”)(P5),

kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name=’fpn_c4p4′)(C4)])

P3 = kerasLayer.Add(name=”fpn_p3add”)([

kerasLayer.UpSampling2D(size=(2, 2), name=”fpn_p4upsampled”)(P4),

kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name=’fpn_c3p3′)(C3)])

P2 = kerasLayer.Add(name=”fpn_p2add”)([

kerasLayer.UpSampling2D(size=(2, 2), name=”fpn_p3upsampled”)(P3),

kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name=’fpn_c2p2′)(C2)])

P2 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding=”SAME”, name=”fpn_p2″)(P2)

P3 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding=”SAME”, name=”fpn_p3″)(P3)

P4 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding=”SAME”, name=”fpn_p4″)(P4)

P5 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding=”SAME”, name=”fpn_p5″)(P5)

P6 = kerasLayer.MaxPooling2D(pool_size=(1, 1), strides=2, name=”fpn_p6″)(P5)

rpn_feature_maps = [P2, P3, P4, P5, P6]

CNNModel_feature_maps = [P2, P3, P4, P5]

if mode == “trainingmodel”:

anchors = self.FindAnchors(config.IMAGE_SHAPE)

anchors = nump.broadcast_to(anchors, (config.BATCH_SIZE,) + anchors.shape)

anchors = kerasLayer.Lambda(lambda x: tensorf.Variable(anchors), name=”anchors”)(inumput_image)


anchors = inumput_anchors

rpn = build_rpn_model(config.RPN_ANCHOR_STRIDE,



Layer Outputs are initialized below.

layer_outputs = []

for p in rpn_feature_maps:


output_names = [“rpn_class_logits”, “rpn_class”, “rpn_bbox”]

outputs = list(zip(*layer_outputs))

outputs = [kerasLayer.Concatenate(axis=1, name=n)(list(o))

for o, n in zip(outputs, output_names)]

rpn_class_logits, rpn_class, rpn_bbox = outputs

proposal_count = config.POST_NMS_ROIS_TRAINING if mode == “trainingmodel”\


rpn_rois = ProposalLayer(




config=config)([rpn_class, rpn_bbox, anchors])

if mode == “trainingmodel”:

active_class_ids = kerasLayer.Lambda(

lambda x: FindImageMetaGraph(x)[“active_class_ids”]


if not config.USE_RPN_ROIS:

inumput_rois = kerasLayer.Inumput(shape=[config.POST_NMS_ROIS_TRAINING, 4],

name=”inumput_roi”, dtype=nump.int32)

target_rois = kerasLayer.Lambda(lambda x: applyNorm(

x, BackendKeras.shape(inumput_image)[1:3]))(inumput_rois)


target_rois = rpn_rois

rois, target_class_ids, target_bbox, target_mask =\

DetectionTargetLayer(config, name=”proposal_targets”)([

target_rois, inumput_gt_class_ids, gt_boxes, inumput_gt_masks])

CNNModel_class_logits, CNNModel_class, CNNModel_bbox =\

fpn_classifier_graph(rois, CNNModel_feature_maps, inumput_image_meta,

config.POOL_SIZE, config.NUM_CLASSES,



CNNModel_mask = build_fpn_mask_graph(rois, CNNModel_feature_maps,





output_rois = kerasLayer.Lambda(lambda x: x * 1, name=”output_rois”)(rois)

rpn_class_loss = kerasLayer.Lambda(lambda x: rpn_class_loss_graph(*x), name=”rpn_class_loss”)(

[inumput_rpn_match, rpn_class_logits])

rpn_bbox_loss = kerasLayer.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name=”rpn_bbox_loss”)(

[inumput_rpn_bbox, inumput_rpn_match, rpn_bbox])

class_loss = kerasLayer.Lambda(lambda x: CNNModel_class_loss_graph(*x), name=”CNNModel_class_loss”)(

[target_class_ids, CNNModel_class_logits, active_class_ids])

bbox_loss = kerasLayer.Lambda(lambda x: CNNModel_bbox_loss_graph(*x), name=”CNNModel_bbox_loss”)(

[target_bbox, target_class_ids, CNNModel_bbox])

mask_loss = kerasLayer.Lambda(lambda x: CNNModel_mask_loss_graph(*x), name=”CNNModel_mask_loss”)(

[target_mask, target_class_ids, CNNModel_mask])

inumputs = [inumput_image, inumput_image_meta,

inumput_rpn_match, inumput_rpn_bbox, inumput_gt_class_ids, inumput_gt_boxes, inumput_gt_masks]

if not config.USE_RPN_ROIS:


outputs = [rpn_class_logits, rpn_class, rpn_bbox,

CNNModel_class_logits, CNNModel_class, CNNModel_bbox, CNNModel_mask,

rpn_rois, output_rois,

rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]

model = kerasModel.Model(inumputs, outputs, name=’mask_rcnn’)


CNNModel_class_logits, CNNModel_class, CNNModel_bbox =\

fpn_classifier_graph(rpn_rois, CNNModel_feature_maps, inumput_image_meta,

config.POOL_SIZE, config.NUM_CLASSES,




Detection Layer is created using CNN model and inumput_image_metal.

detections = DetectionLayer(config, name=”CNNModel_detection”)(

[rpn_rois, CNNModel_class, CNNModel_bbox, inumput_image_meta])

detection_boxes = kerasLayer.Lambda(lambda x: x[…, :4])(detections)

CNNModel_mask = build_fpn_mask_graph(detection_boxes, CNNModel_feature_maps,





model = kerasModel.Model([inumput_image, inumput_image_meta, inumput_anchors],

[detections, CNNModel_class, CNNModel_bbox,

CNNModel_mask, rpn_rois, rpn_class, rpn_bbox],


if config.GPU_COUNT > 1:

from CNNModel.parallel_model import ParallelModel

model = ParallelModel(model, config.GPU_COUNT)

return model


Python  Code Samples for Insurance premium prediction

We look at Health insurance premium prediction with different factors listed below:

  • age
    • primary insurance holder age
  • sex
    • male, female
  • bmi
    • Body Mass Index = (kg/m^2)
  • children/dependents
    • number of children or number of dependents
  • smoker
    • smoking yes or no
  • region
    • US, NorthEast, SouthEast, SouthWest, NorthWest
  • charges
    • Medical costs covered by the health insurance policy


Data Preparation

The goal here is to predict the health insurance premium for coverage of medical costs. You can  read data from a csv. Data has the variables, smoker, sex, age, bmi, children, and region.

import pandas as pand

import numpy as nump

import seaborn as seab

import matplotlib.pyplot as matplt

from sklearn.preprocessing import LabelEncoder

import seaborn as seab

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score

from sklearn import linear_model

from sklearn.linear_model import LassoCV

dataframe = pand.read_csv(‘health_insurance.csv’)


mean_smo, mean_non_smo = dataframe[‘charges’][dataframe[‘smoker’] == ‘yes’].mean(), dataframe[‘charges’][dataframe[‘smoker’] == ‘no’].mean()

mean_male, mean_female = dataframe[‘charges’][dataframe[‘sex’] == ‘male’].mean(), dataframe[‘charges’][dataframe[‘sex’] == ‘female’].mean()

mean_bmi_large, mean_bmi_normal = dataframe[‘charges’][dataframe[‘bmi’] > 25].mean(), dataframe[‘charges’][dataframe[‘bmi’] <= 25].mean()

mean_young, mean_old = dataframe[‘charges’][dataframe[‘age’] < 35].mean(), dataframe[‘charges’][dataframe[‘age’] >= 35].mean()

mean_no_child, mean_child = dataframe[‘charges’][dataframe[‘children’] == 0].mean(), dataframe[‘charges’][dataframe[‘children’] > 0].mean()

dataframe = dataframe.drop([‘children’,’region’], axis = 1)

cat_var = [‘sex’,’smoker’]

dataframe[‘sex’] = pand.get_dummies(dataframe[‘sex’], sparse=True)

dataframe[‘smoker’] = pand.get_dummies(dataframe[‘smoker’], sparse=True)

def columnadd (age,smoker,bmi):

if age>35 and smoker ==1 and bmi >25:

return 1


return 0

dataframe[‘High_Risk’] = dataframe[[‘age’,’smoker’,’bmi’]].apply(lambda x: columnadd(*x), axis=1)

mean_high_risk, mean_not_high_risk = dataframe[‘charges’][dataframe[‘High_Risk’] == 1].mean(), dataframe[‘charges’][dataframe[‘High_Risk’] == 0].mean()

mean_high_risk, mean_not_high_risk

def columnadd1 (smoker,bmi):

if smoker ==1 and bmi >25:

return 1


return 0

dataframe[‘Medium_Risk’] = dataframe[[‘smoker’,’bmi’]].apply(lambda x: columnadd1(*x), axis=1)

mean_med_risk, mean_not_med_risk = dataframe[‘charges’][dataframe[‘Medium_Risk’] == 1].mean(), dataframe[‘charges’][dataframe[‘Medium_Risk’] == 0].mean()

X = dataframe[[‘age’, ‘sex’, ‘bmi’, ‘smoker’,’High_Risk’, ‘Medium_Risk’]]

Y = dataframe[‘charges’]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)


LassoRegression Model

The model above is trained and tested with test size of 30% and training as 70%.

You  can use LassoRegression model  for prediction.

regr = linear_model.LinearRegression(),y_train)

result = regr.predict(X_test)

print(“Mean squared error is %.2f”

% mean_squared_error(y_test, result))

print(‘Variance score is %.2f’ % r2_score(y_test, result))

lasso_eps = 0.0001



model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5),y_train)



y_predited_lasso = model_lasso.predict(X_test)


matplt.xlabel(“Predicted Health Insurance”)

matplt.ylabel(“Actual Health Insurance”)

print(‘Variance score is %.2f’ % r2_score(y_test, y_predited_lasso))


Python Code Samples for Insurance claims fraud detection

Let us look at fraud detection in health insurance claims. We will use unsupervised ML method with Auto-encoder.


Data Preprocessing

import pandas as pand

import numpy as nump

import seaborn as seab

from matplotlib import rcParams

import tensorflow as tensorf

from tensorflow import keras

from tensorflow.keras import backend as BackendKer

from tensorflow.keras.models import Sequential, Model

from tensorflow.keras.layers import Activation, Dense, Dropout

from tensorflow.keras.layers import BatchNormalization, Inumput, Lambda

from tensorflow.keras import regularizers

from tensorflow.keras.losses import mse, binary_crossentropy

from sklearn.metrics import log_loss

from sklearn.metrics import precision_recall_curve, average_precision_score

from sklearn.metrics import roc_curve, auc, roc_auc_score

from sklearn import preprocessing as sklprep

get_ipython().run_line_magic(‘matplotlib’, ‘inline’)

import matplotlib.pyplot as matlabplt‘seaborn’)

matlabplt.rcParams[‘figure.figsize’] = (12, 6)

dataframe = pand.read_csv(‘medical_insurance_fraud_train.csv’, index_col=0)

dataframe_test = pand.read_csv(‘medical_insurance_fraud_test.csv’, index_col=0)

def createBuckets(val, size, count):


for i in range(count):

if val <= (i+1)*size:

return i

return i

def createBucketsForDataFramedataframe):

dataframe[‘Age_group’] = [createBuckets(x, 10, 5) for x in dataframe[‘Age’]]

dataframe[‘BMI_group’] = [createBuckets(x, 10, 5) for x in dataframe[‘BMI’]]

dataframe.drop([‘Age’], axis=1, inumplace=True)

dataframe.drop([‘BMI’], axis=1, inumplace=True)

return dataframe

dataframe = createBucketsForDataFramedataframe)

dataframe_test = createBucketsForDataFramedataframe_test)

Auto Encoders

Many use Auto-encoders for encoding the data. You can use Data Variables like Age, BMI  for grouping the data and encoding.

def singlehotEncode(dataframe):

dataframe = pand.concat([dataframe,pand.get_dummies(dataframe[‘Age_group’], prefix=’Age’)],axis=1)

dataframe = pand.concat([dataframe,pand.get_dummies(dataframe[‘BMI_group’], prefix=’BMI’)],axis=1)

dataframe.drop([‘Age_group’], axis=1, inumplace=True)

dataframe.drop([‘BMI_group’], axis=1, inumplace=True)

return dataframe

dataframe = singlehotEncode(dataframe)

dataframe_test = singlehotEncode(dataframe_test)

outliers = pand.DataFrame()

outliers[‘cust_id’] = dataframe.index.values

outliers[‘Fraud’] = dataframe[‘Fraud’]

outliers[‘Cost’] = dataframe[‘Cost’]

outliers_test = pand.DataFrame()

outliers_test[‘cust_id’] = dataframe_test.index.values

outliers_test[‘Fraud’] = dataframe_test[‘Fraud’]

outliers_test[‘Cost’] = dataframe_test[‘Cost’]

dataX = dataframe.copy().drop([‘Fraud’],axis=1)

testDataX = dataframe_test.copy().drop([‘Fraud’],axis=1)

testDataY = dataframe_test[‘Fraud’].copy()

featuresToScale = dataX.columns

sX = sklprep.StandardScaler(copy=True, with_mean=True, with_std=True)

dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])

featuresToScale = testDataX.columns

testDataX.loc[:,featuresToScale] = sX.fit_transform(testDataX[featuresToScale])


Anomaly Detection

We can detect anomalies and score them within the data during analysis.

def ScoreAnomalies(originalDF, reducedDF):

loss = nump.sum((nump.array(originalDF) – nump.array(reducedDF))**2, axis=1)

loss = pand.Series(data=loss,index=originalDF.index)

loss = (loss-nump.min(loss))/(nump.max(loss)-nump.min(loss))

print(‘Mean for anomaly scores: ‘, nump.mean(loss))

return loss

model = Sequential()

model.add(Dense(units=14, activation=’linear’,inumput_dim=14))

model.add(Dense(units=14, activation=’linear’))

model.add(Dense(units=14, activation=’linear’))




history =, y=dataX,




validation_data=(dataX, dataX),


def createPlot(trueLabels, anomalyScores, returnPreds = False):

preds = pand.concat([trueLabels, anomalyScores], axis=1)

preds.columns = [‘trueLabel’, ‘anomalyScore’]

precision, recall, thresholds =         precision_recall_curve(preds[‘trueLabel’],                                preds[‘anomalyScore’])

average_precision = average_precision_score(                         preds[‘trueLabel’], preds[‘anomalyScore’])

matlabplt.step(recall, precision, color=’k’, alpha=0.7, where=’post’)

matlabplt.fill_between(recall, precision, step=’post’, alpha=0.3, color=’k’)



matlabplt.ylim([0.0, 1.05])

matlabplt.xlim([0.0, 1.0])

matlabplt.title(‘Precision-Recall curve: Average Precision =         {0:0.2f}’.format(average_precision))

fpr, tpr, thresholds = roc_curve(preds[‘trueLabel’],                                      preds[‘anomalyScore’])

areaUnderROC = auc(fpr, tpr)


matlabplt.plot(fpr, tpr, color=’r’, lw=2, label=’ROC curve’)

matlabplt.plot([0, 1], [0, 1], color=’k’, lw=2, linestyle=’–‘)

matlabplt.xlim([0.0, 1.0])

matlabplt.ylim([0.0, 1.05])

matlabplt.xlabel(‘False Positive Rate’)

matlabplt.ylabel(‘True Positive Rate’)

matlabplt.title(‘Receiver operating characteristic: Area under the         curve = {0:0.2f}’.format(areaUnderROC))

matlabplt.legend(loc=”lower right”)

if returnPreds==True:

return preds, average_precision

predictions = model.predict(testDataX, verbose=1)

anomalyScoresAE = ScoreAnomalies(testDataX, predictions)

preds = createPlot(testDataY, anomalyScoresAE, True)

test_scores = []

best_precision = 0

for i in range(0,10):

model = Sequential()

model.add(Dense(units=14, activation=’linear’,inumput_dim=14))

model.add(Dense(units=14, activation=’linear’))

model.add(Dense(units=14, activation=’linear’))




num_epochs = 10

batch_size = 256

history =, y=dataX,




validation_data=(dataX, dataX),


predictions = model.predict(testDataX, verbose=1)

anomalyScoresAE = ScoreAnomalies(testDataX, predictions)

preds, avgPrecision = createPlot(testDataY, anomalyScoresAE, True)


if avgPrecision > best_precision:

best_precision = avgPrecision

print(“Saving model with best precision: “, best_precision), “./fraud_model/”)

print(“Mean average precision over 10 runs is “, nump.mean(test_scores))

print(“Coefficient of variation over 10 runs is “, nump.std(test_scores)/                                                 nump.mean(test_scores))

imported_model = tensorf.keras.models.load_model(“./health_insurance_fraud_model/”)

predictions = imported_model.predict(testDataX, verbose=1)

anomalyScoresAE = ScoreAnomalies(testDataX, predictions)

testDataCost = dataframe_test[‘Cost’].copy()

dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)

dataframe_preds.columns = [‘Cost’, ‘Fraud’, ‘AnomalyScore’]

conditions = [

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] >= 0.01),

(dataframe_preds[‘Fraud’] == 0) & (dataframe_preds[‘AnomalyScore’] >= 0.01),

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] < 0.01)]

choices = [1, 2, 3]

dataframe_preds[‘FraudPredict’] =, choices, default=0)




Outlier Detection – Fraud Analysis

We can detect outliers in the data using Cost, CustomerId, and Fraud variables.

outliers = pand.DataFrame()

outliers[‘cust_id’] = dataframe_preds.index.values

outliers[‘FraudPredict’] = dataframe_preds[‘FraudPredict’]

outliers[‘Cost’] = dataframe_preds[‘Cost’]

seab.scatterplot(x=’cust_id’, y=’Cost’, data=outliers, s=100, legend=’brief’, hue=’FraudPredict’)

testDataCost = dataframe_test[‘Cost’].copy()

dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)

dataframe_preds.columns = [‘Cost’, ‘Fraud’, ‘AnomalyScore’]

conditions = [

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] >= 0.005),

(dataframe_preds[‘Fraud’] == 0) & (dataframe_preds[‘AnomalyScore’] >= 0.005),

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] < 0.005)]

choices = [1, 2, 3]

dataframe_preds[‘FraudPredict’] =, choices, default=0)


outliers = pand.DataFrame()

outliers[‘cust_id’] = dataframe_preds.index.values

outliers[‘FraudPredict’] = dataframe_preds[‘FraudPredict’]

outliers[‘Cost’] = dataframe_preds[‘Cost’]

seab.scatterplot(x=’cust_id’, y=’Cost’, data=outliers, s=100, legend=’brief’, hue=’FraudPredict’)

testDataCost = dataframe_test[‘Cost’].copy()

dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)

dataframe_preds.columns = [‘Cost’, ‘Fraud’, ‘AnomalyScore’]

conditions = [

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] >= 0.003),

(dataframe_preds[‘Fraud’] == 0) & (dataframe_preds[‘AnomalyScore’] >= 0.003),

(dataframe_preds[‘Fraud’] == 1) & (dataframe_preds[‘AnomalyScore’] < 0.003)]

choices = [1, 2, 3]

dataframe_preds[‘FraudPredict’] =, choices, default=0)


outliers = pand.DataFrame()

outliers[‘cust_id’] = dataframe_preds.index.values

outliers[‘FraudPredict’] = dataframe_preds[‘FraudPredict’]

outliers[‘Cost’] = dataframe_preds[‘Cost’]

seab.scatterplot(x=’cust_id’, y=’Cost’, data=outliers, s=100, legend=’brief’, hue=’FraudPredict’)




The AI/ML model is analyzed for different performance areas and metrics. The metrics used for verifying the trained model and prediction are :

  • False Negatives
  • False Positives
  • Accuracy
  • Precision = True Positives /(True Positives + False positives)
  • Recall = True Positive /(True Positives + False Negatives)
  • AUC = Area under the curve  of ROC curve = P (Random Positive sample rank > Random Negative Sample)
  • ROC (Receiver Operating Characteristic Curve)

For more information, you can check out  Tensorflow.  There are many other use cases in insurance domain where AI/ML can be applied.


About the Author

Bhagvan Kommadi is the Founder of Quantica Computacao & has around 20 years’ experience in the industry, ranging from large scale enterprise development to helping incubate software product start-ups. He has done Masters in Industrial Systems Engineering at Georgia Institute of Technology (1997) and Bachelors in Aerospace Engineering from the Indian Institute of Technology, Madras (1993). He is a member of the IFX Forum, Oracle JCP, and a participant in the Java Community Process. He is a member of the MIT Technology Review Global Panel. He is currently the Director of Product Engineering at ValueMomentum. He has reviewed the Manning book titled: “Machine Learning with TensorFlow”. He is also the author of  Packt Publishing’s book – “Hands-On Data Structures and Algorithms with Go”. He is currently working with ValueMomentum as Director of Product Engineering.

Recent Posts

View All Posts