Lending Club allows investors to invest in Peer-to-Peer loans. The loan size varies but can be as small as $25. In another word, with a $1,000 investment, an investor can easily build a portfolio of 40 loans. A well-built portfolio can effectively balance between returns and risks.
As an investor myself, I started to invest in Lending Club Loans in 2015. I enjoyed a healthy return of ~10% at that time, but recently the return rode a roller coaster and dropped to ~2.5%. Hearing similar horror stories happening around me, I decided to perform a deep dive. The project is purely for exploration and learning purposes. It would be great if it can help me understand more about driving factors of defaults and late payments; even better, if it can help identify more effective methodology for investing. I would like to share the process and finding here for everyone’s interest.
Data Cleanup and Feature Engineering:
To start with, I collected over 800K+ loan details from 2007 to 2015. Immediately out of box, I noted that many data features are categorical in nature. Take home ownership types, and the credit purposes for example; they are qualitative features with valuable information. Need to convert them smartly to keep the key information at the same time not negatively impact modeling. In addition, owners’ aggregated data, such as annual income, or debt balance, when looking at alone, do not provide full picture of the loan. Combine features and converting them into ratios, such as asset to debt ratio, more accurately reflect the specific condition of the loan.
I started with Grid Search for model selection. Models compared includes: Decision Tree, Logistic, SVC, Bagging Classifier, and etc. Measuring by AUC, Random Forest Model was identified as best baseline model. I further optimized Random Forest model for best performance.
Conclusion and Finding:
I plot the Confusion Matrix for visualization. It turns out that the model is able to accurately predict 72% of bad loans (loans with negative outcomes, either become default or late for payments), whereas 28% bad loans left as unpredicted. However, the model tends to perform conservatively, predicting loans to have negative outcome when they actually not .
Overall, Top 3 drivers drive the defaults are:
- Interest Rate
- Inquiries of Last 6 Months
- Loan Grade
Recommended Next Steps:
- Continue to incorporate new data to refine the model
- Monitor behavior changes of loans over time. This includes economic crisis. This is to help evaluate the stability and adaptivity of the model.