Travis James

Predicting Loan Application Decision and Grade Using Lending Club Data

2016-10-28T00:00:00+00:00

Time truly flies when you’re learning data science! I wish I could say that is always synonymous with having fun, but the bootcamp has proven quite challenging at times. Either way, six weeks are in the books, and we’ve already reached the halfway point in the course. We’ve really started to hit a stride with the amount of material we’re covering on a daily basis, and actually absorbing the lectures has seemed easier as time has gone on. With that being said, our last assignment, Project McNulty, was a step up in terms of difficulty compared to the previous two projects. Not only were we required to build a classification model of our choosing, but we had to turn it into a web app and incorporate an interactive visualization as well. Luckily this one was a group project, so there were four of us to distribute the workload between. After much deliberation, we decided to address a classification problem surrounding Lending Club loan application data. Namely, we set out to predict if someone, given their credit history and demographic information, would be accepted for a loan. If they were to be accepted, we also wanted to predict the grade of loan they should expect to receive. As peer to peer lending has become a useful instrument for many in securing financing, we thought it would be both interesting and helpful for potential applicants to see if they should expect to be accepted for their ideal loan.

The first step to this process was building our predictive model. This actually turned out to be pretty straight forward, given that classification has a lot of fundamental similarities to continuous variable predictive models. The main difference is that the target variable(s) (in this case both application decision and grade) is(are) categorical. Most of the tools and techniques, however, are essentially the same. Therefore, after organizing and cleaning the data, we were ready to begin testing models. After trying a lot of different techniques, including support vector machines, naive bayes, and logistic regression, we settled on an extreme gradient boosting model for both classifications. The first model was able to predict application decision with 94% accuracy on a testing set, with 82% precision and 80% recall. Our second model predicted loan grade with an accuracy of 56.7%, which is not nearly as high as the application decision model, but still significantly better than random guessing (about 20% given the five grade outcomes).

The next step was to build our web app to make our models interactive for a potential Lending Club applicant. We utilized a flask app infrastructure, and were able to get a functioning web app that takes input in terms of both user-typed values and drop down selected categorical values. The user input was then fed to our trained models, and would return a response on whether they should expect to be accepted or not. If so, it would also list their expected loan grade. After hours of tweaking our app through a trial and error process, to get a final product that functioned effectively and provided meaningful results was a really gratifying experience.

The last step was to incorporate an interactive visualization using the d3.js library. We ended up creating a map to show geospatial state-level loan data. Specifically, we showed the percentage of applicants from each state that were accepted, their average debt to income ratio, and the average amount they applied for. We incorporated a hover over feature so that users of the app could simply scroll their mouse pointer over the state to get a window of all of these statistics. Lastly, we color coated the states in different shades of green depending of the percentage of accepted applicants. The higher the percent that were accepted, the darker the shade of green. This was to allow a visual aid to the users of the app for digesting which states are hot spots for lending club loans.

While McNulty presented a new set of challenges, especially in terms of the web development components of flask and d3.js, overall it was a very enlightening experience into the process of designing an interactive product. Given that our classification models performed significantly better than random guessing, and we were able to build an app that provided useful insights into the world of peer to peer loan applications, I’m satisfied with the end result of McNulty. With that being said, I’m excited to see what’s in store for the coming weeks, and am looking forward to the new challenges, techniques and tools that lie ahead.

Predicting Movie Review Sentiment Using IMDB Web Data

2016-10-18T00:00:00+00:00

It’s been an arduous, if not exciting last three weeks of bootcamp work. Between exercises, projects, networking and trying to absorb all of the material, the course has shaped up to be a real juggling act of effort and time management. While the entire experience has proved enlightening thus far, the whole reason for signing up for the bootcamp was to gain tangible, applicable skills in data science. I certainly haven’t been disappointed yet, and the fundamentals of data science that we’ve been learning were showcased first hand in our most recent assignment: Project Luther.

The goal of the project is fairly simple. Using online data from any number of film and cinema websites, we were asked to build a predictive model for a continuous target variable. The variable we chose was not important, so long as it’s numerical and (somewhat) continuous. I decided to predict movie review sentiment, or how the public feels about the quality of a certain film. As a proxy for review sentiment, I decided to use IMDB user score since most reviews are left by casual viewers instead of professional critics, giving a more representative view of public sentiment. One of the nice things about the IMDB website is that each movie page has a number of key features that could be used as regressors in my analysis. Namely, these include domestic total gross revenue, budget, primary and secondary genre, year of release, month of release, MPAA rating, runtime, number of IMDB reviews, number of Oscar wins, and Metacritic score.

The business case for the project was simple to come up with. A more positive public sentiment for a film leads to more tickets sold at the box office, more rentals and purchases of the film, the possibility of franchising and signing advertising deals, and also furthers the reputation of the actors and director of the film. The real challenge was making the model as accurate as possible, and doing so without overfitting. After scraping all of the relevant features from the IMDB website, I had data on 720 films to build my model and test which regressors are statistically significant. I started out by visualizing the data, and noticed a couple features seemed to have a logarithmic relationship with the target. Specifically, number of IMDB user reviews and domestic total gross revenue appear to have a nonlinear relationship with IMDB user score.

After incorporating these nonlinearities in the model, I ran some naive linear regressions to pick out which variables were significant predictors. As it turns out, all of my numerical feature variables (Metacritic score, runtime, log of number of reviews, log of domestic total gross revenue, Oscar wins, and budget) turned out to be statistically significant regressors at the 10% level of significance. I incorporated the categorical features as fixed effects by transforming them into a series of dummy variables, and whittled them down until all were significant at the 5% level of significance.

(Categorical dummy variables were left out of the above table display as they don’t aid in interpreting the numerical model coefficients, and are simply used to control for fixed effects)

One interesting finding is that Metacritic score and runtime both have positive and statistically significant effects on public review sentiment. I had a notion that ex ante, if a movie is received well by critics, it will upwardly bias a user’s review of the film. I was also curious if a longer runtime would have a positive or negative impact on review sentiment. One could argue that a longer runtime would mean a viewer is more likely to become bored of the film, but also allows the plot to develop more and garner viewer intrigue. We see that the latter effect wins out in the data, and that favorable critic reviews and longer runtime appear to positively influence the general public’s opinion of a movie. It’s also interesting to note that budget and number of Oscar wins negatively influence review sentiment. This could be for a number of reasons, but I would guess the most likely explanation is that action films and Hollywood blockbusters, which command high budgets and win awards for their visual effects, might not resonate as much with viewers than other genres. This is purely speculation, and would be interesting to look into further.

After completing the feature selection process, the next step was to run several different models and decide which one performed the best. I started by splitting the data into a 70% training set and 30% test set. I used the training set to fit the models I was running, and the test set to predict user scores from the test features and compare them to the actual test set target values. The models I chose to run were OLS, ridge, lasso and elastic net regressions, as well as random forest and gradient boosting models. After tuning the hyper parameters of the parametric regressions using a grid search cross validation, I tested their accuracies and found that ridge regression with a penalty coefficient of 1 performed the best. Specifically, it had a test set R-squared coefficient of 0.687, and thus explains a sizable majority of the variation in IMDB user score.

When taking a closer look at the performance of the ridge regression, it seems to perform fairly weel for movies that are rated highly, but performs less well for more poorly reviewed films. As we can see in the first plot below, the actual and predicted user scores hug the 45 degree line reasonably closely, but the predicted values are a bit inflated at lower scoring movies. This is apparent in the second plot below, which shows that the distribution of the residuals is slightly leftward skewed, implying some degree of upward inflation for the predicted scores.

With that being said, there are some limitations to the predictive model specified. As I just mentioned, the model tends to over predict scores for poorer performing films. This could be a symptom of the fact that I scraped films by genre in descending order of score from the IMDB website. A remedy for this would be to scrape films with a wider range of user scores, and allow for a more robust training of the model. This would also increase sample size, which wasn’t small by any means, but could have been made larger to improve performance. Also, there may be a simultaneity issue since the number of IMDB use reviews and the user scores could very well be self reinforcing. Therefore, it is difficult to tease out the direction of causality between these two variables, and if the simultaneity does exist it could bias all of the coefficient values in the model. One remedy for this would be to use an instrumental variable on number of user reviews, and try to isolate the variation in this feature that influences user score. Either way, these are speculative considerations, and it remains to be seen if these potential improvements would increase the predictive capabilities of an already well performing model.

Some general thoughts on Project Luther and the bootcamp as a whole thus far:

The first individual project was a bit of a reality check, but was a very rewarding experience.
Presenting a project that is your own piece of work is fun and beneficial. The feedback you get on an individualized piece of work is very constructive, and should be welcomed at any point in time.
The pace of the program is picking up and the amount of material we’ve already covered is piling up. Metis has really started to live up to its title of being a “bootcamp”.
When working on a project with a lot of intermediate steps, it is helpful to assign yourself mini deadlines.
Trying to get work done on weekends is tough. Finishing my work during the week has allowed me to explore San Francisco and really destress on Saturday and Sunday. This is a practice I hope to keep up, as the city has so much to offer and explore.
The NBA season is right around the corner, so let the fanfare and friendly trash talk begin!

Week 1 at Metis: Blazing Saddles

2016-09-26T00:00:00+00:00

The first week of the boot camp is in the books, and it’s off to a scorching start. While the pacing was somewhat of an adjustment to get used to, after finishing a years-worth of intense study for my master’s degree it felt a little like riding a bike. The first day was complete with introductions, administrative duties, and even a preliminary lecture on Pandas. While the lecture was effective and broad, it was different than a traditional academic course. Specifically, I really enjoyed how applied and practical the instruction was, and this aspect of the boot camp is exactly what I signed up for.

At the end of the day we were divided into groups of four and were given our first project: Project Benson. The premise was more of a lack of premise. We were given turnstile data from the New York City Metropolitan Transport Authority and were asked to act as a data science consultancy. Our fictional client needs help optimizing some feature of their business using this data, and we need to come up with a recommendation for them through a detailed exploratory data analysis. The point of the project was not only to give us practice exploring and visualizing data, but also to allow us to express some creativity while gaining valuable group work experience.

After throwing a few ideas around, we decided that our fictitious client would be called BSKR (vowels purposely excluded). The company helps professional buskers (street performers) maximize their earning potential through a scheduling platform while increasing viewership for their network as a whole. Essentially, we would find the busiest stations in terms of foot traffic, and would allocate BSKR’s clients equitably amongst these destinations. Therefore, performers are placed in strategic areas to maximize their potential earnings, but aren’t stepping on each other’s toes in the process. Presentations for the project were held on Friday, so we had little time to waste.

After Monday, the schedule for the next three days was more or less as follows:

Pair programming challenges for about an hour to start the day.
Lecture until lunch.
An hour and a half break for lunch (definitely a nice time to catch up on work, but also to socialize and get to know one another).
After lunch, there would usually be some additional supplementary lectures for an hour or two.
The last bit of the day was reserved for working on Benson.

Once Friday rolled around we still had a pair programming challenge, but were given the rest of the day to work on the finishing touches for our presentations. After lunch we presented to the cohort, and were asked questions and given feedback (both positive and negative) on our project as a whole. After writing up our final proposal, we submitted all of our materials to GitHub and called it a week. Benson is officially in the books as the first entry into our Metis portfolio of projects, with four more to come throughout the next 11 weeks.

Some general thoughts on week 1:

Fast paced, but effective nonetheless.
Loving the applied nature of the program thus far. We ended up really getting our hands dirty with data munging and exploring a fairly messsy data set, which is great experience.
Experience presenting is invaluable. A good data science project means little to nothing if you don’t know how to effectively communicate your findings!
The cohort is from a diverse set of backgrounds, including Physics, Economics, Biology and Health Care, among others. It’s great getting to know so many interesting people with such varied perspectives.
The food in SF is great! The coffee, too.
I’m excited for basketball season to start up. The group has quite a few NBA fans, so the communal fandom and banter should be fun.

While the short deadline for Benson and amount of material made the first week a bit daunting, I have genuinely positive reviews of the experience thus far and am looking forward to what the next 11 weeks has in store.