Problem Statement
We are having Housing Finance company and wants to automate the loan eligibility process based on customer detail provided while filling online application form.
About the Data Set
Loan_ID: Unique Loan ID
Gender: Male/ Female
Married: Applicant married (Y/N)
Dependents: Number of dependents
Education: Applicant Education (Graduate/ Under Graduate)
Self_Employed: Self employed (Y/N)
ApplicantIncome: Applicant income
CoapplicantIncome: Coapplicant income
LoanAmount: Loan amount in thousands
Loan_Amount_Term: Term of loan in months
Credit_History: Credit history meets guidelines
Property_Area: Urban/ Semi Urban/ Rural
Loan_Status: (Output Variable) Loan approved (Y/N)
Import the required libraries

Load and verify shape of the data

Data Preprocessing
1.1 Analysis on Categorical Independent Variable vs Target Variable


- The proportion of married applicants is higher for approved loans.
- Distribution of applicants with 1 or 3+ dependents is similar across both the categories of Loan_Status.
- There is nothing significant we can infer from Self_Employed vs Loan_Status plot.
- It seems people with a credit history as 1 are more likely to get their loans approved.
- The proportion of loans getting approved in the semi-urban area is higher as compared to that in rural or urban areas.
1.2 Remove Insignificant Variables
The column Loan_ID contains the serial number of the Applicant, which is redundant for further analysis. Thus, we drop the column.

1.3 Missing Value Treatment

We have null values in few of the variables. We will see value counts of the variables, and decide better way to fill null values.

By looking at variable, Loan_Amount_Term which is numerical variable, the value of 360 is repeating the most. So we will replace the missing values in this variable using the mode of this variable

By looking at variable Loan Amount, We will use the median to fill the null values as we see that the loan amount has outliers so the mean will not be the proper approach as it is highly affected by the presence of outliers.

1.4 Outlier Treatment




The variables are not normally distributed
Due to these outliers bulk of the data in the loan amount is at the left and the right tail is longer. This is called right skewness. One way to remove the skewness is by doing the log transformation. As we take the log transformation, it does not affect the smaller values much but reduces the larger values. So, we get a distribution similar to normal distribution. Let’s visualize the effect of log transformation.
Made an assumption that some people might have a low income but strong CoappliantIncome so a good idea is to combine them in a TotalIncome column. And dropping the coapplicantIncome variable



1.5 Encode the Categorical Variables

1.6 Correlation between all the numerical variables

We see that the most correlate variables are ApplicantIncome — LoanAmount and Credit_History — Loan_Status.
Create a generalized function to plot roc curve for the test set.

Create Input and target variables

1.7 Distribution of Dependent variable

We can see from the plot, data is Imbalanced. We will use SMOTE for over sampling.
SMOTE for Balancing Data



Train-Test Split

Logistic Regression
Build a full logistic model on a training dataset



Interpretation: Using logistic regression, we are getting 72% accuracy.
K Nearest Neighbors (KNN)
Build a knn model on a training dataset using Euclidean distance





Interpretation: Using logistic regression, we are getting 78% accuracy.
Conclusion

Interpretation: We can see from result, KNN model is giving better accuracy and AUC score compare to other models. We can use KNN for deployment.
Thanks for reading.
Follow me on medium : Bhanuprakash – Medium
Comments
Post a Comment