Developing AI/ML applications in insurance
Key Takeaways
- Tensor Flow is the most popular AI/ML Framework for developing AI/ML applications
- AI/ML techniques are applied in the following insurance processes
- Premium Pricing
- Risk Adjustment for health plans
- Claim Processing
- Fraud Detection
Introduction
In this article, we discuss multiple AI/ML Frameworks and compare the features. TensorFlow framework comes out as the popular and best framework based on the features. We will show how to develop AI/ML applications in insurance verticals for typical uses like premium pricing, risk adjustment for health plans, claim processing, and fraud detection.
AI/ML Frameworks
AI and ML Frameworks are getting popular as the applications and solutions are increasing in different enterprises. Among them, TensorFlow is the best open source software package which has features like deep learning, dashboard, analytics, and mobile access. It has the framework for training, testing, and validating data sets for Machine Learning algorithms. It has other features such as multi programming interface, good documentation, GPU support, computational graph abstraction, and popular support from open-source community. It does not have pre-trained models.
Caffe is another framework which has MATLAB support, CNN modeling, plaintext schemas support, and active open source community. Amazon ML has support for self learning components. Developers can execute data analysis, model training, and evaluation. It has features such as data encryption, integration support, and APIs. It does not have support for data visualization. Torch is another framework which has inbuilt algorithms for deep learning networks. PyTorch is the python version. It has GPU support, IOS/Android integration, optimization routines, and array index/slice/transpose routines. PyTorch lacks documentation and is tough for beginners to learn. It started off with Lua support, which many programmers were not knowledgeable. We use Apache Mahout for creating applications related to data analytics and data engineering. It has support for Scala DSL and Multiple backend support. It has features related to clustering, collaborative filtering, and classification. Spark Mlib is faster than Mahout and developers integrate with Hadoop. It has support for Java, R, Scala, and Python.
AI/ML Applications - Use Cases
Let us look at AI and ML Applications in Insurance vertical. We will look at accident damage assessment in auto insurance, healthcare insurance premium prediction, and fraud detection in health care claim processing.
Insurance Damage assessment
This article presents the AI/ML application for accident damage assessment in auto insurance. We apply computer vision and image processing for the damage assessment. You can use Multi-Instance Learning with ResNet Convolution neural network model for image recognition.
Insurance provider for auto insurance processes the auto accident claims and covers the damage cost through various plans. The process starts off with auto shop sharing the estimate and customer posts the claim to the auto insurance from the auto owner.
You need to have a training model with auto car parts, damaged parts, and support for different models. The model helps in predicting the cost of the damage and helps in claim processing. The repair of the damage can change the part completely or partial repair. Repair estimation uses the metadata models. The cost prediction uses the visual models. Multi instance learning method verifies the images, and we capture them in different perspectives. This method helps in improving the accuracy.
You can capture in Auto insurance models various factors listed below:
- Demographics
- Age
- Gender
- Region Code Type
- Income Level
- Family Type
- Vehicle Age
- Vehicle Damage
- Policy premium
- Policy sourcing channel
You need to have an extensive data set to improve the accuracy of the estimation and train the model. You can have separate data sets for testing and validating. An insurance claim will use around 40-50 images for prediction. Typical challenges in this estimation process are real time inference support, high data volume, and traffic.
Insurance premium detection
This article looks into health care plans for premium prediction and risk adjustment. We typically use Tweedie Generalized Linear Model for insurance premium prediction. You can also use Lasso Regression.
Health insurance offers multiple policy plans and they have a model to forecast the retention of their customers who are the policyholders. Insurance policy has the coverage details related to compensation specific to :
- Loss
- Damage
- Illness
- Death
Compensation is specific to the premium paid by the insurance policyholder. Each policy plan has coverage details for various health conditions and treatments. Insurers charge premium at frequency agreed upon by the customer and insurance provider. We base the modeling on two different methods, which are cost prediction based on last twelve months and binary classification based on condition of member’s policy cost exceeding a limit.
You need to ensure that the data used for policy model has the below information:
- Personal Risk Factors
- Personal Data
- Eligibility
- Enrollment Coverage
- Social Determinants of Health
- Clinical History
- Procedures
- Diagnosis
- Drugs
- Clinical Events
- Derived Indices
- Cost History
- Cost Indices
- Cost Trends
Personal data of the customer needs to have the following factors:
- Age
- Gender
- Family Size
- Industry
- Income Level
Social determinants of health should have the following features :
- Social Vulnerability
- Education
- Poverty
- Minority Status
- English-Speaking ability
- Housing
- Transportation
Clinical events need to have the following event types:
- Emergency Department
- Ambulatory
- Hospital Impatient
You need to add derived indices related to clinical history which are the following :
- Mortality rate
- actuarial life expectancy
- Years of Life Lost
Let us see the Factors for Cost history for claims which are listed below:
- Cost over the past one year and two years
- Total cost in the current month and last twelve months
- Cost of Speciality drugs
- Days of Drug supply
Now let us see the Claim Cost Trends, which are shown below:
- Changes in cost over 1 year
- Changes in 6 and 3-month intervals
- Predicted one year cost
Insurance claims fraud detection
This section in the article talks about fraud detection applications in health care claims using AI/ML. We detect fraud in the health claims using AI/ML techniques. Typically, it is tough to get data for claims fraud analysis. Supervised ML technique with data labels is not suitable. We use Un supervised ML technique where the training of the neural network happens using unlabelled data. You can group the data and identify patterns for grouping. We specify patterns as anomalies for detection in the data. You can use Auto-encoders for data translation to a learned representation. They are used to finding anomalies in data which have higher error rate.
You can create data for valid and fraudulent claims when requirements do not allow the usage of public data. We use training, testing, and validation data sets with a mix of valid and fraudulent claims.
Tensor Flow - Implementation
Now, let us look at implementing the above use cases with TensorFlow framework. TensorFlow has features for training the models using data input, pre-trained models support, and visualization. In insurance, Tweedie GLM is popular for premium prediction.
Python Code Samples for Insurance Damage Assessment
Let us look at using Tensorflow and Keras for detecting auto damage and assessment for insurance coverage.
Let us look at the data preprocessing for detecting auto damage in insurance.
Data PreProcessing
import os
import sys
import json
import datetime
import numpy as nump
import skimage.draw
ROOT_DIR = ROOT_DIR = os.getcwd()
sys.path.append(ROOT_DIR) # To find local version of the library
from CNNModel.config import Config
from CNNModel import model as modelpack, utils
if __name__ == '__main__':
import argparse
parser = argparse.ArgumentParser(
description='Train Mask R-CNN to detect the damage.')
parser.add_argument("command",
metavar="<command>",
help="'train' or 'splash'")
parser.add_argument('--dataset', required=False,
metavar="/path/to/custom/dataset/",
help='Directory of the custom dataset')
parser.add_argument('--weights', required=True,
metavar="/path/to/weights.h5",
help="Path to weights .h5 file or 'coco'")
parser.add_argument('--logs', required=False,
default=DEFAULT_LOGS_DIR,
metavar="/path/to/logs/",
help='Logs and checkpoints directory (default=logs/)')
parser.add_argument('--image', required=False,
metavar="path or URL to image",
help='Image to apply the color splash effect on')
args = parser.parse_args()
if args.command == "trainmodel":
assert args.dataset, "Argument --dataset is required for training the model"
elif args.command == "splashapplying":
assert args.image or args.video,\
"Provide --image or --video to apply the color splash"
print("Weights are ", args.weights)
print("Dataset are ", args.dataset)
print("Logs are ", args.logs)
if args.command == "trainodel":
config = CustomConfig()
else:
class InferenceConfig(CustomConfig):
GPU_COUNT = 1
IMAGES_PER_GPU = 1
config = InferenceConfig()
config.display()
if args.command == "trainmodel":
model = modelpack.RCNNModel(mode="trainingmodel", config=config,
model_dir=args.logs)
else:
model = modelpack.RCNNModel(mode="inferencemodel", config=config,
model_dir=args.logs)
if args.weights.lower() == "coco":
weights_path = COCO_WEIGHTS_PATH
if not os.path.exists(weights_path):
utils.download_trained_weights(weights_path)
elif args.weights.lower() == "last":
weights_path = model.findlastCheckPoint()[1]
elif args.weights.lower() == "imagenet":
weights_path = model.FindImagenetWeights()
else:
weights_path = args.weights
print("Loading the weights ", weights_path)
if args.weights.lower() == "coco":
model.loadModelWeights(weights_path, by_name=True, exclude=[
"CNNModel_class_logits", "CNNModel_bbox_fc",
"CNNModel_bbox", "CNNModel_mask"])
else:
model.loadModelWeights(weights_path, by_name=True)
if args.command == "trainmodel":
trainModel(model)
elif args.command == "splashapplying":
detect_and_color_splash(model, image_path=args.image,
video_path=args.video)
else:
print("'{}' is not recognized. "
"Use 'trainmodel' or 'splashapplying'".format(args.command))
RCNN Model
You can use RCNN Model for vehicle damage detection.
class RCNNModel():
def __init__(self, mode, config, model_dir):
assert mode in ['trainingmodel', 'inferencemodel']
self.mode = mode
self.config = config
self.model_dir = model_dir
self.setLogDirectory()
self.keras_model = self.buildModel(mode=mode, config=config)
def buildModel(self, mode, config):
assert mode in ['trainingmodel', 'inferencemodel']
h, w = config.IMAGE_SHAPE[:2]
if h / 2**6 != int(h / 2**6) or w / 2**6 != int(w / 2**6):
raise Exception("Image size must be dividable by 2 at least 6 times "
"to avoid fractions when downscaling and upscaling."
"For example, use 256, 320, 384, 448, 512, ... etc. ")
Inputs for the model are created and initalized below.
inumput_image = kerasLayer.Inumput(
shape=[None, None, config.IMAGE_SHAPE[2]], name="inumput_image")
inumput_image_meta = kerasLayer.Inumput(shape=[config.IMAGE_META_SIZE],
name="inumput_image_meta")
if mode == "trainingmodel":
inumput_rpn_match = kerasLayer.Inumput(
shape=[None, 1], name="inumput_rpn_match", dtype=tensorf.int32)
inumput_rpn_bbox = kerasLayer.Inumput(
shape=[None, 4], name="inumput_rpn_bbox", dtype=tensorf.float32)
inumput_gt_class_ids = kerasLayer.Inumput(
shape=[None], name="inumput_gt_class_ids", dtype=tensorf.int32)
inumput_gt_boxes = kerasLayer.Inumput(
shape=[None, 4], name="inumput_gt_boxes", dtype=tensorf.float32)
gt_boxes = kerasLayer.Lambda(lambda x: applyNorm(
x, BackendKeras.shape(inumput_image)[1:3]))(inumput_gt_boxes)
if config.USE_MINI_MASBackendKeras:
inumput_gt_masks = kerasLayer.Inumput(
shape=[config.MINI_MASBackendKeras_SHAPE[0],
config.MINI_MASBackendKeras_SHAPE[1], None],
name="inumput_gt_masks", dtype=bool)
else:
inumput_gt_masks = kerasLayer.Inumput(
shape=[config.IMAGE_SHAPE[0], config.IMAGE_SHAPE[1], None],
name="inumput_gt_masks", dtype=bool)
elif mode == "inferencemodel":
inumput_anchors = kerasLayer.Inumput(shape=[None, 4], name="inumput_anchors")
if callable(config.BACBackendKerasBONE):
_, C2, C3, C4, C5 = config.BACBackendKerasBONE(inumput_image, stage5=True,
train_bn=config.TRAIN_BN)
else:
_, C2, C3, C4, C5 = resnet_graph(inumput_image, config.BACBackendKerasBONE,
stage5=True, train_bn=config.TRAIN_BN)
P5 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c5p5')(C5)
P4 = kerasLayer.Add(name="fpn_p4add")([
kerasLayer.UpSampling2D(size=(2, 2), name="fpn_p5upsampled")(P5),
kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c4p4')(C4)])
P3 = kerasLayer.Add(name="fpn_p3add")([
kerasLayer.UpSampling2D(size=(2, 2), name="fpn_p4upsampled")(P4),
kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c3p3')(C3)])
P2 = kerasLayer.Add(name="fpn_p2add")([
kerasLayer.UpSampling2D(size=(2, 2), name="fpn_p3upsampled")(P3),
kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (1, 1), name='fpn_c2p2')(C2)])
P2 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p2")(P2)
P3 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p3")(P3)
P4 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p4")(P4)
P5 = kerasLayer.Conv2D(config.TOP_DOWN_PYRAMID_SIZE, (3, 3), padding="SAME", name="fpn_p5")(P5)
P6 = kerasLayer.MaxPooling2D(pool_size=(1, 1), strides=2, name="fpn_p6")(P5)
rpn_feature_maps = [P2, P3, P4, P5, P6]
CNNModel_feature_maps = [P2, P3, P4, P5]
if mode == "trainingmodel":
anchors = self.FindAnchors(config.IMAGE_SHAPE)
anchors = nump.broadcast_to(anchors, (config.BATCH_SIZE,) + anchors.shape)
anchors = kerasLayer.Lambda(lambda x: tensorf.Variable(anchors), name="anchors")(inumput_image)
else:
anchors = inumput_anchors
rpn = build_rpn_model(config.RPN_ANCHOR_STRIDE,
len(config.RPN_ANCHOR_RATIOS), config.TOP_DOWN_PYRAMID_SIZE)
Layer Outputs are initialized below.
layer_outputs = []
for p in rpn_feature_maps:
layer_outputs.append(rpn([p]))
output_names = ["rpn_class_logits", "rpn_class", "rpn_bbox"]
outputs = list(zip(*layer_outputs))
outputs = [kerasLayer.Concatenate(axis=1, name=n)(list(o))
for o, n in zip(outputs, output_names)]
rpn_class_logits, rpn_class, rpn_bbox = outputs
proposal_count = config.POST_NMS_ROIS_TRAINING if mode == "trainingmodel"\
else config.POST_NMS_ROIS_INFERENCE
rpn_rois = ProposalLayer(
proposal_count=proposal_count,
nms_threshold=config.RPN_NMS_THRESHOLD,
name="ROI",
config=config)([rpn_class, rpn_bbox, anchors])
if mode == "trainingmodel":
active_class_ids = kerasLayer.Lambda(
lambda x: FindImageMetaGraph(x)["active_class_ids"]
)(inumput_image_meta)
if not config.USE_RPN_ROIS:
inumput_rois = kerasLayer.Inumput(shape=[config.POST_NMS_ROIS_TRAINING, 4],
name="inumput_roi", dtype=nump.int32)
target_rois = kerasLayer.Lambda(lambda x: applyNorm(
x, BackendKeras.shape(inumput_image)[1:3]))(inumput_rois)
else:
target_rois = rpn_rois
rois, target_class_ids, target_bbox, target_mask =\
DetectionTargetLayer(config, name="proposal_targets")([
target_rois, inumput_gt_class_ids, gt_boxes, inumput_gt_masks])
CNNModel_class_logits, CNNModel_class, CNNModel_bbox =\
fpn_classifier_graph(rois, CNNModel_feature_maps, inumput_image_meta,
config.POOL_SIZE, config.NUM_CLASSES,
train_bn=config.TRAIN_BN,
fc_layers_size=config.FPN_CLASSIF_FC_LAYERS_SIZE)
CNNModel_mask = build_fpn_mask_graph(rois, CNNModel_feature_maps,
inumput_image_meta,
config.MASBackendKeras_POOL_SIZE,
config.NUM_CLASSES,
train_bn=config.TRAIN_BN)
output_rois = kerasLayer.Lambda(lambda x: x * 1, name="output_rois")(rois)
rpn_class_loss = kerasLayer.Lambda(lambda x: rpn_class_loss_graph(*x), name="rpn_class_loss")(
[inumput_rpn_match, rpn_class_logits])
rpn_bbox_loss = kerasLayer.Lambda(lambda x: rpn_bbox_loss_graph(config, *x), name="rpn_bbox_loss")(
[inumput_rpn_bbox, inumput_rpn_match, rpn_bbox])
class_loss = kerasLayer.Lambda(lambda x: CNNModel_class_loss_graph(*x), name="CNNModel_class_loss")(
[target_class_ids, CNNModel_class_logits, active_class_ids])
bbox_loss = kerasLayer.Lambda(lambda x: CNNModel_bbox_loss_graph(*x), name="CNNModel_bbox_loss")(
[target_bbox, target_class_ids, CNNModel_bbox])
mask_loss = kerasLayer.Lambda(lambda x: CNNModel_mask_loss_graph(*x), name="CNNModel_mask_loss")(
[target_mask, target_class_ids, CNNModel_mask])
inumputs = [inumput_image, inumput_image_meta,
inumput_rpn_match, inumput_rpn_bbox, inumput_gt_class_ids, inumput_gt_boxes, inumput_gt_masks]
if not config.USE_RPN_ROIS:
inumputs.append(inumput_rois)
outputs = [rpn_class_logits, rpn_class, rpn_bbox,
CNNModel_class_logits, CNNModel_class, CNNModel_bbox, CNNModel_mask,
rpn_rois, output_rois,
rpn_class_loss, rpn_bbox_loss, class_loss, bbox_loss, mask_loss]
model = kerasModel.Model(inumputs, outputs, name='mask_rcnn')
else:
CNNModel_class_logits, CNNModel_class, CNNModel_bbox =\
fpn_classifier_graph(rpn_rois, CNNModel_feature_maps, inumput_image_meta,
config.POOL_SIZE, config.NUM_CLASSES,
train_bn=config.TRAIN_BN,
fc_layers_size=config.FPN_CLASSIF_FC_LAYERS_SIZE)
Detection Layer is created using CNN model and inumput_image_metal.
detections = DetectionLayer(config, name="CNNModel_detection")(
[rpn_rois, CNNModel_class, CNNModel_bbox, inumput_image_meta])
detection_boxes = kerasLayer.Lambda(lambda x: x[..., :4])(detections)
CNNModel_mask = build_fpn_mask_graph(detection_boxes, CNNModel_feature_maps,
inumput_image_meta,
config.MASBackendKeras_POOL_SIZE,
config.NUM_CLASSES,
train_bn=config.TRAIN_BN)
model = kerasModel.Model([inumput_image, inumput_image_meta, inumput_anchors],
[detections, CNNModel_class, CNNModel_bbox,
CNNModel_mask, rpn_rois, rpn_class, rpn_bbox],
name='mask_rcnn')
if config.GPU_COUNT > 1:
from CNNModel.parallel_model import ParallelModel
model = ParallelModel(model, config.GPU_COUNT)
return model
Python Code Samples for Insurance premium prediction
We look at Health insurance premium prediction with different factors listed below:
- age
- primary insurance holder age
- sex
- male, female
- bmi
- Body Mass Index = (kg/m^2)
- children/dependents
- number of children or number of dependents
- smoker
- smoking yes or no
- region
- US, NorthEast, SouthEast, SouthWest, NorthWest
- charges
- Medical costs covered by the health insurance policy
Data Preparation
The goal here is to predict the health insurance premium for coverage of medical costs. You can read data from a csv. Data has the variables, smoker, sex, age, bmi, children, and region.
import pandas as pand
import numpy as nump
import seaborn as seab
import matplotlib.pyplot as matplt
from sklearn.preprocessing import LabelEncoder
import seaborn as seab
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn import linear_model
from sklearn.linear_model import LassoCV
dataframe = pand.read_csv('health_insurance.csv')
dataframe.isna().sum()
mean_smo, mean_non_smo = dataframe['charges'][dataframe['smoker'] == 'yes'].mean(), dataframe['charges'][dataframe['smoker'] == 'no'].mean()
mean_male, mean_female = dataframe['charges'][dataframe['sex'] == 'male'].mean(), dataframe['charges'][dataframe['sex'] == 'female'].mean()
mean_bmi_large, mean_bmi_normal = dataframe['charges'][dataframe['bmi'] > 25].mean(), dataframe['charges'][dataframe['bmi'] <= 25].mean()
mean_young, mean_old = dataframe['charges'][dataframe['age'] < 35].mean(), dataframe['charges'][dataframe['age'] >= 35].mean()
mean_no_child, mean_child = dataframe['charges'][dataframe['children'] == 0].mean(), dataframe['charges'][dataframe['children'] > 0].mean()
dataframe = dataframe.drop(['children','region'], axis = 1)
cat_var = ['sex','smoker']
dataframe['sex'] = pand.get_dummies(dataframe['sex'], sparse=True)
dataframe['smoker'] = pand.get_dummies(dataframe['smoker'], sparse=True)
def columnadd (age,smoker,bmi):
if age>35 and smoker ==1 and bmi >25:
return 1
else:
return 0
dataframe['High_Risk'] = dataframe[['age','smoker','bmi']].apply(lambda x: columnadd(*x), axis=1)
mean_high_risk, mean_not_high_risk = dataframe['charges'][dataframe['High_Risk'] == 1].mean(), dataframe['charges'][dataframe['High_Risk'] == 0].mean()
mean_high_risk, mean_not_high_risk
def columnadd1 (smoker,bmi):
if smoker ==1 and bmi >25:
return 1
else:
return 0
dataframe['Medium_Risk'] = dataframe[['smoker','bmi']].apply(lambda x: columnadd1(*x), axis=1)
mean_med_risk, mean_not_med_risk = dataframe['charges'][dataframe['Medium_Risk'] == 1].mean(), dataframe['charges'][dataframe['Medium_Risk'] == 0].mean()
X = dataframe[['age', 'sex', 'bmi', 'smoker','High_Risk', 'Medium_Risk']]
Y = dataframe['charges']
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state = 100)
LassoRegression Model
The model above is trained and tested with test size of 30% and training as 70%.
You can use LassoRegression model for prediction.
regr = linear_model.LinearRegression()
regr.fit(X_train,y_train)
result = regr.predict(X_test)
print("Mean squared error is %.2f"
% mean_squared_error(y_test, result))
print('Variance score is %.2f' % r2_score(y_test, result))
lasso_eps = 0.0001
lasso_nalpha=20
lasso_iter=10000
model_lasso= LassoCV(eps=lasso_eps,n_alphas=lasso_nalpha,max_iter=lasso_iter, normalize=True,cv=5)
model_lasso.fit(X_train,y_train)
print(list(zip(model_lasso.coef_,X_train.columns)))
print(model_lasso.intercept_)
y_predited_lasso = model_lasso.predict(X_test)
matplt.scatter(y_predited_lasso,y_test)
matplt.xlabel("Predicted Health Insurance")
matplt.ylabel("Actual Health Insurance")
print('Variance score is %.2f' % r2_score(y_test, y_predited_lasso))
Python Code Samples for Insurance claims fraud detection
Let us look at fraud detection in health insurance claims. We will use unsupervised ML method with Auto-encoder.
Data Preprocessing
import pandas as pand
import numpy as nump
import seaborn as seab
from matplotlib import rcParams
import tensorflow as tensorf
from tensorflow import keras
from tensorflow.keras import backend as BackendKer
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Activation, Dense, Dropout
from tensorflow.keras.layers import BatchNormalization, Inumput, Lambda
from tensorflow.keras import regularizers
from tensorflow.keras.losses import mse, binary_crossentropy
from sklearn.metrics import log_loss
from sklearn.metrics import precision_recall_curve, average_precision_score
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn import preprocessing as sklprep
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as matlabplt
matlabplt.style.use('seaborn')
matlabplt.rcParams['figure.figsize'] = (12, 6)
dataframe = pand.read_csv('medical_insurance_fraud_train.csv', index_col=0)
dataframe_test = pand.read_csv('medical_insurance_fraud_test.csv', index_col=0)
def createBuckets(val, size, count):
i=0
for i in range(count):
if val <= (i+1)*size:
return i
return i
def createBucketsForDataFramedataframe):
dataframe['Age_group'] = [createBuckets(x, 10, 5) for x in dataframe['Age']]
dataframe['BMI_group'] = [createBuckets(x, 10, 5) for x in dataframe['BMI']]
dataframe.drop(['Age'], axis=1, inumplace=True)
dataframe.drop(['BMI'], axis=1, inumplace=True)
return dataframe
dataframe = createBucketsForDataFramedataframe)
dataframe_test = createBucketsForDataFramedataframe_test)
Auto Encoders
Many use Auto-encoders for encoding the data. You can use Data Variables like Age, BMI for grouping the data and encoding.
def singlehotEncode(dataframe):
dataframe = pand.concat([dataframe,pand.get_dummies(dataframe['Age_group'], prefix='Age')],axis=1)
dataframe = pand.concat([dataframe,pand.get_dummies(dataframe['BMI_group'], prefix='BMI')],axis=1)
dataframe.drop(['Age_group'], axis=1, inumplace=True)
dataframe.drop(['BMI_group'], axis=1, inumplace=True)
return dataframe
dataframe = singlehotEncode(dataframe)
dataframe_test = singlehotEncode(dataframe_test)
outliers = pand.DataFrame()
outliers['cust_id'] = dataframe.index.values
outliers['Fraud'] = dataframe['Fraud']
outliers['Cost'] = dataframe['Cost']
outliers_test = pand.DataFrame()
outliers_test['cust_id'] = dataframe_test.index.values
outliers_test['Fraud'] = dataframe_test['Fraud']
outliers_test['Cost'] = dataframe_test['Cost']
dataX = dataframe.copy().drop(['Fraud'],axis=1)
testDataX = dataframe_test.copy().drop(['Fraud'],axis=1)
testDataY = dataframe_test['Fraud'].copy()
featuresToScale = dataX.columns
sX = sklprep.StandardScaler(copy=True, with_mean=True, with_std=True)
dataX.loc[:,featuresToScale] = sX.fit_transform(dataX[featuresToScale])
featuresToScale = testDataX.columns
testDataX.loc[:,featuresToScale] = sX.fit_transform(testDataX[featuresToScale])
Anomaly Detection
We can detect anomalies and score them within the data during analysis.
def ScoreAnomalies(originalDF, reducedDF):
loss = nump.sum((nump.array(originalDF) - nump.array(reducedDF))**2, axis=1)
loss = pand.Series(data=loss,index=originalDF.index)
loss = (loss-nump.min(loss))/(nump.max(loss)-nump.min(loss))
print('Mean for anomaly scores: ', nump.mean(loss))
return loss
model = Sequential()
model.add(Dense(units=14, activation='linear',inumput_dim=14))
model.add(Dense(units=14, activation='linear'))
model.add(Dense(units=14, activation='linear'))
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
history = model.fit(x=dataX, y=dataX,
epochs=num_epochs,
batch_size=batch_size,
shuffle=True,
validation_data=(dataX, dataX),
verbose=1)
def createPlot(trueLabels, anomalyScores, returnPreds = False):
preds = pand.concat([trueLabels, anomalyScores], axis=1)
preds.columns = ['trueLabel', 'anomalyScore']
precision, recall, thresholds = precision_recall_curve(preds['trueLabel'], preds['anomalyScore'])
average_precision = average_precision_score( preds['trueLabel'], preds['anomalyScore'])
matlabplt.step(recall, precision, color='k', alpha=0.7, where='post')
matlabplt.fill_between(recall, precision, step='post', alpha=0.3, color='k')
matlabplt.xlabel('Recall')
matlabplt.ylabel('Precision')
matlabplt.ylim([0.0, 1.05])
matlabplt.xlim([0.0, 1.0])
matlabplt.title('Precision-Recall curve: Average Precision = {0:0.2f}'.format(average_precision))
fpr, tpr, thresholds = roc_curve(preds['trueLabel'], preds['anomalyScore'])
areaUnderROC = auc(fpr, tpr)
matlabplt.figure()
matlabplt.plot(fpr, tpr, color='r', lw=2, label='ROC curve')
matlabplt.plot([0, 1], [0, 1], color='k', lw=2, linestyle='--')
matlabplt.xlim([0.0, 1.0])
matlabplt.ylim([0.0, 1.05])
matlabplt.xlabel('False Positive Rate')
matlabplt.ylabel('True Positive Rate')
matlabplt.title('Receiver operating characteristic: Area under the curve = {0:0.2f}'.format(areaUnderROC))
matlabplt.legend(loc="lower right")
matlabplt.show()
if returnPreds==True:
return preds, average_precision
predictions = model.predict(testDataX, verbose=1)
anomalyScoresAE = ScoreAnomalies(testDataX, predictions)
preds = createPlot(testDataY, anomalyScoresAE, True)
test_scores = []
best_precision = 0
for i in range(0,10):
model = Sequential()
model.add(Dense(units=14, activation='linear',inumput_dim=14))
model.add(Dense(units=14, activation='linear'))
model.add(Dense(units=14, activation='linear'))
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['accuracy'])
num_epochs = 10
batch_size = 256
history = model.fit(x=dataX, y=dataX,
epochs=num_epochs,
batch_size=batch_size,
shuffle=True,
validation_data=(dataX, dataX),
verbose=0)
predictions = model.predict(testDataX, verbose=1)
anomalyScoresAE = ScoreAnomalies(testDataX, predictions)
preds, avgPrecision = createPlot(testDataY, anomalyScoresAE, True)
test_scores.asklprepend(avgPrecision)
if avgPrecision > best_precision:
best_precision = avgPrecision
print("Saving model with best precision: ", best_precision)
tensorf.saved_model.save(model, "./fraud_model/")
print("Mean average precision over 10 runs is ", nump.mean(test_scores))
print("Coefficient of variation over 10 runs is ", nump.std(test_scores)/ nump.mean(test_scores))
imported_model = tensorf.keras.models.load_model("./health_insurance_fraud_model/")
predictions = imported_model.predict(testDataX, verbose=1)
anomalyScoresAE = ScoreAnomalies(testDataX, predictions)
testDataCost = dataframe_test['Cost'].copy()
dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)
dataframe_preds.columns = ['Cost', 'Fraud', 'AnomalyScore']
conditions = [
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] >= 0.01),
(dataframe_preds['Fraud'] == 0) & (dataframe_preds['AnomalyScore'] >= 0.01),
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] < 0.01)]
choices = [1, 2, 3]
dataframe_preds['FraudPredict'] = nump.select(conditions, choices, default=0)
dataframe_preds.tail()
pand.value_counts(dataframe_preds['FraudPredict'])
Outlier Detection - Fraud Analysis
We can detect outliers in the data using Cost, CustomerId, and Fraud variables.
outliers = pand.DataFrame()
outliers['cust_id'] = dataframe_preds.index.values
outliers['FraudPredict'] = dataframe_preds['FraudPredict']
outliers['Cost'] = dataframe_preds['Cost']
seab.scatterplot(x='cust_id', y='Cost', data=outliers, s=100, legend='brief', hue='FraudPredict')
testDataCost = dataframe_test['Cost'].copy()
dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)
dataframe_preds.columns = ['Cost', 'Fraud', 'AnomalyScore']
conditions = [
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] >= 0.005),
(dataframe_preds['Fraud'] == 0) & (dataframe_preds['AnomalyScore'] >= 0.005),
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] < 0.005)]
choices = [1, 2, 3]
dataframe_preds['FraudPredict'] = nump.select(conditions, choices, default=0)
pand.value_counts(dataframe_preds['FraudPredict'])
outliers = pand.DataFrame()
outliers['cust_id'] = dataframe_preds.index.values
outliers['FraudPredict'] = dataframe_preds['FraudPredict']
outliers['Cost'] = dataframe_preds['Cost']
seab.scatterplot(x='cust_id', y='Cost', data=outliers, s=100, legend='brief', hue='FraudPredict')
testDataCost = dataframe_test['Cost'].copy()
dataframe_preds = pand.concat([testDataCost, testDataY, anomalyScoresAE], axis=1)
dataframe_preds.columns = ['Cost', 'Fraud', 'AnomalyScore']
conditions = [
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] >= 0.003),
(dataframe_preds['Fraud'] == 0) & (dataframe_preds['AnomalyScore'] >= 0.003),
(dataframe_preds['Fraud'] == 1) & (dataframe_preds['AnomalyScore'] < 0.003)]
choices = [1, 2, 3]
dataframe_preds['FraudPredict'] = nump.select(conditions, choices, default=0)
pand.value_counts(dataframe_preds['FraudPredict'])
outliers = pand.DataFrame()
outliers['cust_id'] = dataframe_preds.index.values
outliers['FraudPredict'] = dataframe_preds['FraudPredict']
outliers['Cost'] = dataframe_preds['Cost']
seab.scatterplot(x='cust_id', y='Cost', data=outliers, s=100, legend='brief', hue='FraudPredict')
Conclusions
The AI/ML model is analyzed for different performance areas and metrics. The metrics used for verifying the trained model and prediction are :
- False Negatives
- False Positives
- Accuracy
- Precision = True Positives /(True Positives + False positives)
- Recall = True Positive /(True Positives + False Negatives)
- AUC = Area under the curve of ROC curve = P (Random Positive sample rank > Random Negative Sample)
- ROC (Receiver Operating Characteristic Curve)
For more information, you can check out Tensorflow. There are many other use cases in insurance domain where AI/ML can be applied.
About the Author
Bhagvan Kommadi is the Founder of Quantica Computacao & has around 20 years' experience in the industry, ranging from large scale enterprise development to helping incubate software product start-ups. He has done Masters in Industrial Systems Engineering at Georgia Institute of Technology (1997) and Bachelors in Aerospace Engineering from the Indian Institute of Technology, Madras (1993). He is a member of the IFX Forum, Oracle JCP, and a participant in the Java Community Process. He is a member of the MIT Technology Review Global Panel. He is currently the Director of Product Engineering at ValueMomentum. He has reviewed the Manning book titled: "Machine Learning with TensorFlow”. He is also the author of Packt Publishing's book - "Hands-On Data Structures and Algorithms with Go". He is currently working with ValueMomentum as Director of Product Engineering.
Accern is a no-code AI platform that provides an end-to-end data science process that enables data scientists at financial organizations and insurance firms to easily build models that uncover actionable findings from structured and unstructured data.
Recent Posts
Developing Computer Vision Applications in Data Scarce Environments
Introduction In today’s digital era, computer vision stands as a transformative technology, driving innovations across...
By Sumedh DatarDecember 12, 2023
5 Effective Risk Management Strategies When Trading in Crypto
Cryptocurrency has slowly made its way into the mainstream, and more people have begun thinking...
By Trix MejiaFebruary 25, 2022
How AI Will Be Impacted By the Rise of NFTs?
In 2021, there was no way to really escape the online chatter about NFTs. In...
By Ellen KaneFebruary 15, 2022