Skip to main content

Loan prediction system using ML models


Problem Statement

We are having Housing Finance company and wants to automate the loan eligibility process based on customer detail provided while filling online application form.

About the Data Set

Loan_ID: Unique Loan ID

Gender: Male/ Female

Married: Applicant married (Y/N)

Dependents: Number of dependents

Education: Applicant Education (Graduate/ Under Graduate)

Self_Employed: Self employed (Y/N)

ApplicantIncome: Applicant income

CoapplicantIncome: Coapplicant income

LoanAmount: Loan amount in thousands

Loan_Amount_Term: Term of loan in months

Credit_History: Credit history meets guidelines

Property_Area: Urban/ Semi Urban/ Rural

Loan_Status: (Output Variable) Loan approved (Y/N)

Import the required libraries

Load and verify shape of the data

Data Preprocessing

1.1 Analysis on Categorical Independent Variable vs Target Variable

  1. The proportion of married applicants is higher for approved loans.
  2. Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
  3. There is nothing significant we can infer from Self_Employed vs Loan_Status plot.
  4. It seems people with a credit history as 1 are more likely to get their loans approved.
  5. The proportion of loans getting approved in the semi-urban area is higher as compared to that in rural or urban areas.

1.2 Remove Insignificant Variables

The column Loan_ID contains the serial number of the Applicant, which is redundant for further analysis. Thus, we drop the column.

1.3 Missing Value Treatment

We have null values in few of the variables. We will see value counts of the variables, and decide better way to fill null values.

By looking at variable, Loan_Amount_Term which is numerical variable, the value of 360 is repeating the most. So we will replace the missing values in this variable using the mode of this variable

By looking at variable Loan Amount, We will use the median to fill the null values as we see that the loan amount has outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.

1.4 Outlier Treatment

The variables are not normally distributed

Due to these outliers bulk of the data in the loan amount is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the log transformation. As we take the log transformation, it does not affect the smaller values much but reduces the larger values. So, we get a distribution similar to normal distribution. Let’s visualize the effect of log transformation.

Made an assumption that some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column. And dropping the coapplicantIncome variable

1.5 Encode the Categorical Variables

1.6 Correlation between all the numerical variables

We see that the most correlate variables are ApplicantIncome — LoanAmount and Credit_History — Loan_Status.

Create a generalized function to plot roc curve for the test set.

Create Input and target variables

1.7 Distribution of Dependent variable

We can see from the plot, data is Imbalanced. We will use SMOTE for over sampling.

SMOTE for Balancing Data

Train-Test Split

Logistic Regression

Build a full logistic model on a training dataset

Interpretation: Using logistic regression, we are getting 72% accuracy.

K Nearest Neighbors (KNN)

Build a knn model on a training dataset using Euclidean distance

Interpretation: Using logistic regression, we are getting 78% accuracy.

Conclusion

Interpretation: We can see from result, KNN model is giving better accuracy and AUC score compare to other models. We can use KNN for deployment.

Thanks for reading.


Follow me on medium : Bhanuprakash – Medium


Comments