-
Notifications
You must be signed in to change notification settings - Fork 0
/
data_analysis_with_r_final_project.Rmd
1098 lines (790 loc) · 42.2 KB
/
data_analysis_with_r_final_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
html_document:
theme: readable
---
# Prosper Loans Exploration by Gabor Sar
```{r echo=FALSE, message=FALSE, warning=FALSE}
library(ggplot2)
library(dplyr)
library(GGally)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
loans <- read.csv('prosperLoanData.csv')
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
# I am going to use the following ggplot theme to rotate the labels
# of plots on the x axis, to prevent long labels covering each other
rotate_x <- theme(axis.text.x = element_text(angle = 90, hjust = 1))
# I am going to use the following ggplot guide to force legends to have 100%
# visibility on scatterplots where I use alpha less than 1 (to avoid
# overplotting).
legend_alpha_1 <- guides(colour = guide_legend(override.aes = list(alpha = 1)))
# a large custrom color palette
custom_colors <- c("#89C5DA", "#DA5724", "#74D944", "#CE50CA", "#3F4921",
"#C0717C", "#CBD588", "#5F7FC7", "#673770", "#D3D93E",
"#38333E", "#508578", "#D7C1B1", "#689030", "#AD6F3B",
"#CD9BCD", "#D14285", "#6DDE88", "#652926", "#7FDCC0",
"#C84248", "#8569D5", "#5E738F", "#D1A33D", "#8A7C64",
"#599861")
```
## 1 Abstract
I am going to analyse the loan data from Prosper. I would like to know what properties can describe a loan and what parameters can affect those properties. Is the amount the most important, or the monthly payment? Is the amount affected by the credit score of the borrower? The key goal of this analysis is to find answers to these questions.
## 2 Introduction
Prosper ([an American peer to peer lending marketplace](https://www.prosper.com/about)) provides a daily snapshot of their loan data. There are 113937 loans in the dataset with 81 features. As there are a lot of features, I am going to subset the data to those that I am interested in:
1) loan original amount
2) monthly loan payment
3) term
4) loan origination date
5) stated monthly income
6) debt to income ratio
7) employment status
8) employment status duration
9) credit scorer range lower
10) credit score range upper
11) borrower rate
12) borrower APR
13) listing category
14) loan status
Loan original amount, monthly payment and term are the most important features for this analysis. Those describe how much people borrow, how much they have to pay back, and how quickly. I would like to know what underlying trends those features have, and what kind of relationships they have with each other and other features (like how they changed over the time). I also would like to know whose are the borrowers and if their characteristics have any relation to the first three factors.
Listing category contains integer numbers. In order to make plotting by it easier I am going to convert it into a factor, using the following levels:
Number | Level
--- | ---
0 | Not Available
1 | Debt Consolidation
2 | Home Improvement
3 | Business
4 | Personal Loan
5 | Student Use
6 | Auto
7 | Other
8 | Baby&Adoption
9 | Boat
10 | Cosmetic Procedure
11 | Engagement Ring
12 | Green Loans
13 | Household Expenses
14 | Large Purchases
15 | Medical/Dental
16 | Motorcycle
17 | RV
18 | Taxes
19 | Vacation
20 | Wedding Loans
```{r echo=FALSE, message=FALSE, warning=FALSE}
loans$ListingCategory = factor(loans$ListingCategory..numeric., levels = 0:20)
levels(loans$ListingCategory) <- c('Not Available',
'Debt Consolidation',
'Home Improvement',
'Business',
'Personal Loan',
'Student Use',
'Auto',
'Other',
'Baby&Adoption',
'Boat',
'Cosmetic Procedure',
'Engagement Ring',
'Green Loans',
'Household Expenses',
'Large Purchases',
'Medical/Dental',
'Motorcycle',
'RV',
'Taxes',
'Vacation',
'Wedding Loans')
```
Loan origination date is a factors of strings. Converting it to dates could also make plotting easier.
```{r echo=FALSE, message=FALSE, warning=FALSE}
loans$LoanOriginationDate <- as.Date(loans$LoanOriginationDate)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
# subset the dataset to the interesting features
loans <- loans %>% select(LoanOriginalAmount,
MonthlyLoanPayment,
Term,
LoanOriginationDate,
StatedMonthlyIncome,
DebtToIncomeRatio,
CreditScoreRangeLower,
CreditScoreRangeUpper,
BorrowerRate,
BorrowerAPR,
EmploymentStatus,
EmploymentStatusDuration,
ListingCategory,
LoanStatus)
```
## 3 Univariate Analysis
```{r echo=FALSE, message=FALSE, warning=FALSE}
names(loans)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
str(loans)
```
### 3.1 Loan Original Amount
The most important property of a loan is the original amount.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$LoanOriginalAmount)
```
The mininmum is $1,000.00, the maximum is $35,000.00, and the median as $6.500,00.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginalAmount), data = loans) +
geom_histogram()
```
Most of the loan original amount values are between $1,000.00 and $10,000.00.
There are some values that are more frequent than the others. To find out which values, I am going to list the most frequent ones:
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(as.factor(loans$LoanOriginalAmount), maxsum = 11)
```
Based on the list of the ten most frequent values, it seems like approximately every five thousandth.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginalAmount), data = loans) +
geom_histogram(binwidth = 5000)
```
Setting the binwidth to 5000 shows a decreasing trend.
### 3.2 Monthly Loan Payment
The second most important property of a loan is the monthly payment.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$MonthlyLoanPayment)
```
This minimum is $0.00, the maximum is $2,252.00, and the median is $217,7.00.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = MonthlyLoanPayment), data = loans) +
geom_histogram()
```
The majority of the monthly payments are between $0.00 and $500.00.
There are some outliers.
I am going to limit the monthly payment to the .99 percentile and set the binwidth to 10.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = MonthlyLoanPayment), data = loans) +
geom_histogram(binwidth = 10) +
xlim(0, quantile(loans$MonthlyLoanPayment, .99))
```
Based on this plot a lot of loans have a monthly payment of $0.00.
Count 0 values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$MonthlyLoanPayment == 0)
```
935 loans have a monthly payment of $0.00. Let's see what is the status of those loans.
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(subset(loans, MonthlyLoanPayment == 0)$LoanStatus)
```
All of those loans are completed, defaulted or the final payment is in progress.
### 3.3 Term
Term is also a critical variable, as that is the most likely to have a strong relationship with the amount or the monthly payment.
Number of loans per term:
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$Term)
```
Proportion of loans per term:
```{r echo=FALSE, message=FALSE, warning=FALSE}
round(table(loans$Term) / nrow(loans), 2)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
# make a pie chart
# http://docs.ggplot2.org/current/coord_polar.html
ggplot(aes(x = factor(1), fill = factor(Term)), data = loans) +
geom_bar(width = 1) +
coord_polar(theta = 'y') +
scale_fill_grey()
```
77% of the loans last 36 months (3 years), 21% last 60 months (5 years) and 1% last 12 months (1 year).
### 3.4 Loan Origination Date
I am interested in the changes of the different variables over the time. First, let's see how the number of loans changed.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginationDate), data = loans) +
geom_histogram(binwidth = 7) +
scale_x_date()
```
There is an increasing trend in the number of loans from 2006 to late 2008 and from late 2009 to 2014, and there is a gap between late 2008 and late 2009.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginationDate), data = loans) +
geom_histogram(binwidth = 7) +
scale_x_date(lim = c(as.Date('2008-08-01'), as.Date('2009-10-01')))
```
Enlarging that timeframe shows that there were almost 0 loans registered in an approximately 10-months period. The most likely reason for this anomaly is the subprime mortgage crisis.
### 3.5 Stated Monthly Income
I would like to know how much is the monthly income of a borrower.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$StatedMonthlyIncome)
```
The minimum is $0.00, the maximum is $1,750,000.00, and the median is $4,667.00. The difference between the median and the maximum is significant.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = StatedMonthlyIncome), data = loans) +
geom_histogram()
```
The histogram of monthly incomes shows a serious outlier issue. This supports my feelings about the big difference between the median ($4,667.00), and the maximum ($1,750,003.00).
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = StatedMonthlyIncome), data = loans) +
geom_histogram(binwidth = 150) +
xlim(0, quantile(loans$StatedMonthlyIncome, .99))
```
Limiting the values to the .99 quantile and setting the binwidth to 150 shows a much better, positively skewed, normal distribution.
The only thing that does not seem obvious is the high frequency of the 0 values.
Number of 0 values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$StatedMonthlyIncome == 0)
```
There are 1394 loans in the dataset with 0 stated monthly income.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# I am going to store the subset of zero StatedMonthlyIncome loans,
# so I can investigate it deeper
loans_zero_monthly_income <- subset(loans, StatedMonthlyIncome == 0)
table(loans_zero_monthly_income$ListingCategory)
```
Listing category does not explain the zero values.
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans_zero_monthly_income$LoanStatus)
```
Most of these loans are completed, charged off or defaulted, but still 259 of them are current.
### 3.6 Debt to Income Ratio
Let's have a look at the how much a borrower have to spend on paying debts back.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$DebtToIncomeRatio)
```
The minimum is 0.000, the maximum is 10.010, and the median is 0.220.
There are some outliers.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = DebtToIncomeRatio), data = loans) +
xlim(0, quantile(loans$DebtToIncomeRatio, .99, na.rm = TRUE)) +
geom_histogram(binwidth = 0.01)
```
Most of the debt to income ratio values are between 0.14 and 0.32.
The histogram shows a positively skewed, normal distribution.
### 3.7 Employment Status
I would like to know if there is any trend in them employment status of the borrowers, and later if it has any relationship to the original amount or monthly payment of the loans.
Number of loans per employment status:
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$EmploymentStatus)
```
Proportion of loans per employment status:
```{r echo=FALSE, message=FALSE, warning=FALSE}
round(table(loans$EmploymentStatus) / nrow(loans), 2)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = factor(1), fill = factor(EmploymentStatus)), data = loans) +
geom_bar(width = 1) +
coord_polar(theta = 'y') +
scale_fill_brewer(type = 'div')
```
89% of the borrowers employed or retired (`Employed`, `Full-time`, `Part-time`, `Retired`, `Self-employed`), <1% `Not employed`, and there is no useful employment information about 10% of them (`NA`, `Not available`, `Other`).
### 3.8 Employment Status Duration
Let's see the length of the employment statuses.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$EmploymentStatusDuration)
```
The miminum is 0.00, the maximum is 755.00, and the median is 67.00.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = EmploymentStatusDuration), data = loans) +
geom_histogram(binwidth = 7)
```
Most of the employment status durations are between 0 and 200 months, and the distribution shows a decreasing trend.
### 3.9 Credit Score
Credit score represents the creditworthiness of a borrower, and it is used by banks and other lenders to evaluate the risk of a loan. Therefore, I would like to see how it correlates with the properties of peer-to-peer loans.
The dataset contains both credit score range lower and credit score range upper variables. I am going to check if there is any difference between the two, or I can omit one of them in my analysis.
```{r echo=FALSE, message=FALSE, warning=FALSE}
# calculate the difference of credit score range upper and
# credit score range lower
credit_score_range_difference = loans$CreditScoreRangeUpper -
loans$CreditScoreRangeLower
```
Summary of (credit score range upper - credit score range lower) values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(credit_score_range_difference)
```
Number of (credit score range upper - credit score range lower) values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(credit_score_range_difference)
```
The difference of credit score range upper and credit score range lower is always 19. Therefore, I can omit one of those values. I am going to use credit score range lower in my analysis.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$CreditScoreRangeLower)
```
The mimimum is 0.0, the maximum is 880.0, and the median is 680.0.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = CreditScoreRangeLower), data = loans) +
geom_histogram(binwidth = 10)
```
Most of the credit score range lower values are between 600 and 800.
All values are multiples of 10.
The distribution seems almost normal.
There are some outliers.
### 3.10 Borrower Rate
Borrower rate is the borrower's interest rate for a loan.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$BorrowerRate)
```
The miminum is 0.0000, the maximum is 0.4975, and the median is 0.1840.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = BorrowerRate), data = loans) +
geom_histogram(binwidth = 0.01)
```
Most frequent values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
loans %>%
group_by(BorrowerRate) %>%
summarise(count = n()) %>%
arrange(desc(count))
```
Most borrower rate values are between 0.1 and 0.3.
The values at 0.32 and 0.35 are unexpectedly frequent.
There are some outliers.
### 3.11 Borrower APR
Borrower APR is the borrower's annual rate for a loan. It is the total cost of a loan and it includes borrower rate as well.
```{r echo=FALSE, message=FALSE, warning=FALSE}
summary(loans$BorrowerAPR)
```
The miminum is 0.00653, the maximum is 0.51230, and the median is 0.20980.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = BorrowerAPR), data = loans) +
geom_histogram(binwidth = 0.01)
```
Most frequent values:
```{r echo=FALSE, message=FALSE, warning=FALSE}
# list the most frequent BorrowerAPR values
loans %>%
group_by(BorrowerAPR) %>%
summarise(count = n()) %>%
arrange(desc(count))
```
Most borrower APR values are between 0.1 and 0.3.
The values at 0.32 and 0.35 are unexpectedly frequent.
There are some outliers.
Based on the distributions, there is a clear relationship between borrower rate and borrower APR. As borrower APR includes borrower rate, this is not a surprise.
### 3.12 Listing Category
I would like to see what the borrowers used their loans for.
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$ListingCategory)
```
```{r echo=FALSE, message=FALSE, warning=FALSE, ListingCategory_histogram}
ggplot(aes(x = ListingCategory), data = loans) +
geom_histogram() +
scale_y_sqrt() +
rotate_x
```
I have used square-root transformation to make the histogram more readable.
Most loans have a listing category of debt consolidation.
A lot of loans does not have a useful listing category (`Not Available`, `Other`).
### 3.13 Loan Status
What is the current status of the loans in the dataset?
```{r echo=FALSE, message=FALSE, warning=FALSE}
table(loans$LoanStatus)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanStatus), data = loans) +
geom_histogram() +
scale_y_sqrt() +
rotate_x
```
I have used square-root transformation to make the histogram more readable.
Most of the loans have a loan status of current, completed, charged off or defaulted. All the other statuses have a very low occurrence in the dataset.
## 4 Bivariate Analysis
```{r echo=FALSE, message=FALSE, warning=FALSE}
# the following values are not in the following table:
# * LoanOriginationDate
# * EmploymentStatus
# * ListingCategory
# * LoanStatus
cor(select(loans,
LoanOriginalAmount,
MonthlyLoanPayment,
Term,
StatedMonthlyIncome,
DebtToIncomeRatio,
EmploymentStatusDuration,
CreditScoreRangeLower,
CreditScoreRangeUpper,
BorrowerRate,
BorrowerAPR),
y = NULL,
use = 'pairwise.complete.obs')
```
Original amount have a strong correlation with monthly payment (0.93198368), and a weak correlation with term (0.33892746), credit score (0.34087445), borrower rate (-0.32895995) and borrower APR (-0.32288669).
Monthly payment have a weak correlation with credit score (0.29253205), borrower rate (-0.24474235) and borrower APR (-0.22665287).
Creddt score have a weak correlation with borrower rate (-0.46156668) and borrower APR (-0.42970732).
There is a very strong correlation between borrower rate and borrower APR (0.989823970). As previously mentioned, borrower APR includes borrower rate, so this correlation is expected.
In this section I am going to analyse the relationships above, the changes in the different features over the time and the relationships between the original amount, monthly payment and listing category and employment status. I also would like to see how employment status duration differs between employment status values, to get a better picture about the borrowers.
### 4.1 Loan Original Amount and Monthly Payment
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$MonthlyLoanPayment, loans$LoanOriginalAmount)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(MonthlyLoanPayment, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1)
```
Original amount and monthly payment have a very strong (0.9319837) positive correlation. Based on the scatterplot there are three strong linear relationships between them. That means that monthly payment cannot describe the variation of original amount alone, there must be something else participating in it. Later in my analysis I am going to try to find that other participant.
### 4.2 Loan Original Amount and Term
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$Term, loans$LoanOriginalAmount)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginalAmount, color = factor(Term)), data = loans) +
geom_density(size = 1) +
scale_color_discrete()
```
There is a weak correlation between original amount and term (0.3389275). The only visible difference is that the values under £10,000.00 are less frequent if the term is 60 months. That means that longer loans do not necessary have a higher original amount, as I was expecting it.
### 4.3 Loan Original Amount and Credit Score
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$CreditScoreRangeLower, loans$LoanOriginalAmount)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(CreditScoreRangeLower, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between original amount and credit score (0.3408745). The scatterplot does not provide any extra information.
### 4.4 Loan Original Amount and Borrower Rate
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerRate, loans$LoanOriginalAmount)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between original amount and borrower rate (-0.3289599). The scatterplot does not provide any extra information.
### 4.5 Loan Original Amount and Borrower APR
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerAPR, loans$LoanOriginalAmount)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerAPR, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between original amount and borrower APR (-0.3228867). The scatterplot does not provide any extra information.
### 4.6 Monthly Payment and Credit Score
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$CreditScoreRangeLower, loans$MonthlyLoanPayment)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(CreditScoreRangeLower, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between monthly payment and credit score (0.292532). The scatterplot does not provide any extra information.
### 4.9 Monthly Payment and Borrower Rate
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerRate, loans$MonthlyLoanPayment)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1)
```
There is a weak correlation between monthly payment and borrower rate (-0.2447424). The scatterplot shows multiple underlying relationships between them. Later in my analysis, I would like to look into this deeper.
### 4.8 Monthly Payment and Borrower APR
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerAPR, loans$MonthlyLoanPayment)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerAPR, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1)
```
There is a weak correlation between monthly payment and borrower APR (-0.2266529). The scatterplot shows very similar relationships to the previous (monthly payment and borrower rate), with a little bit more noise.
### 4.9 Credit Score and Borrower Rate
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerRate, loans$CreditScoreRangeLower)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, CreditScoreRangeLower), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between credit score and borrower rate (-0.4615667). The scatterplot does not provide any extra information.
### 4.10 Credit Score and Borrower APR
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerAPR, loans$CreditScoreRangeLower)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerAPR, CreditScoreRangeLower), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a weak correlation between credit score and borrower APR (-0.4297073). The scatterplot does not provide any extra information.
### 4.11 Borrower Rate and Borrower APR
```{r echo=FALSE, message=FALSE, warning=FALSE}
cor.test(loans$BorrowerRate, loans$BorrowerAPR)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, BorrowerAPR), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is a very strong correlation between borrower rate and borrower APR (0.989824), and the scatterplot supports that. As previously described this is due to the fact, that borrower APR includes borrower rate. The scatterplot shows multiple relationships. I would like to know what makes them different.
### 4.12 Loan Original Amount and Loan Origination Date
Now let's see how different variables changed over the time.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There was no significant change in the original amount before the gap at late 2008, and then before 2011. In 2011, the smaller loans disappeared and in 2013 the popularity of larger loans increased.
### 4.13 Monthly Loan Payment and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
The pattern of changes in monthly payment over the time is very similar to the pattern of changes in original amount, but with much more noise.
### 4.14 Term and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, Term), data = loans) +
geom_jitter(alpha = .1, size = 1) +
scale_y_continuous(breaks = c(12, 36, 60))
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginationDate, color = factor(Term)), data = loans) +
geom_freqpoly(size = 1, binwidth = 30) +
scale_color_discrete() +
scale_x_date() +
scale_y_sqrt()
```
I have used square-root transformation to make the frequency polygon more readable.
There is a strong relationship between origination date and term. All the loans before the previously mentioned gap (between late 2008 and late 2009) have a 36-months term. 12-months loans were only occurring for a 2-years period between 2011 and 2013 and were not popular. 60-months loans started to occur at the same time as 12-months loans (in 2011), and there is an increasing trend in the number of those since then.
### 4.15 Stated Monthly Income and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, StatedMonthlyIncome), data = loans) +
geom_jitter(alpha = .1, size = 1) +
ylim(0, quantile(loans$StatedMonthlyIncome, .99))
```
There was no significant change in stated monthly income over the time.
### 4.16 Debt to Income Ratio and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, DebtToIncomeRatio), data = loans) +
geom_jitter(alpha = .1, size = 1) +
ylim(0, quantile(loans$DebtToIncomeRatio, .99, na.rm = TRUE))
```
There was no significant change in debt to income ratio over the time.
### 4.17 Employment Status and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, EmploymentStatus), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There is no useful employment status information before late 2006. The `Employed` category almost replaced the `Full Time` category in late 2010, this is probably due to a change in the way the data was recorded.
### 4.18 Employment Status Duration and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, EmploymentStatusDuration), data = loans) +
geom_jitter(alpha = .1, size = 1, color = 'blue') +
geom_smooth(color = 'black', size = 1)
```
The black line represents the smoothed conditional mean of employment status duration over the time.
The employment status durations have an increasing trend over the time.
### 4.19 Credit Score and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, CreditScoreRangeLower), data = loans) +
geom_jitter(alpha = .1, size = 1) +
ylim(400, 900)
```
In 2006, all credit score values became greater than 500, and in late 2013, greater than 630. There was no significant change in the frequency of the values greater than 800.
### 4.20 Borrower Rate and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, BorrowerRate), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
There was a significant drop in the maximum borrower rate at the beginning of 2006 (from .36 to .24). The drop was followed by an increase at early 2006 (from 0.24 to 0.29) and at late 2007 (from .19 to .36). There were no too many changes between late 2009 and 2011 when the maximum decreased again (to .32) and the data become much noisier. Between 2012 and 2013 the data became less noisy again, and in the second half of 2013 the values between .1 and .2 became significantly more frequent.
### 4.21 Borrower APR and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, BorrowerAPR), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
Once again, the scatterplot of borrower APR and origination date seems almost the same as the scatterplot of borrower rate and origination date, but whit more noise.
### 4.22 Listing Category and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, ListingCategory), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginationDate, color = factor(ListingCategory)), data = loans) +
geom_freqpoly(size = 1, binwidth = 30) +
scale_color_discrete() +
scale_x_date() +
scale_y_sqrt() +
guides(color = guide_legend(nrow = 6)) +
theme(legend.position = 'bottom')
```
I have used square-root transformation to make the frequency polygon more readable.
There is no useful listing category information before 2008. New listing categories were introduced in 2008 and 2012 as well, this indicates a change in the way the data was recorded, similarly to employment status.
There is no visible pattern within listing categories.
### 4.23 Loan Status and Loan Origination Date
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(LoanOriginationDate, LoanStatus), data = loans) +
geom_jitter(alpha = .1, size = 1)
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(x = LoanOriginationDate, color = factor(LoanStatus)), data = loans) +
geom_freqpoly(size = 1, binwidth = 30) +
scale_color_discrete() +
scale_x_date() +
scale_y_sqrt()
```
I have used square-root transformation to make the frequency polygon more readable.
Almost all the loans that are current were originated after 2011. Most of the completed loans were originated before late 2008, or after late 2009 and before 2013. The majority of the chargedoff loans were originated before the previously mentioned gap.
### 4.24 Loan Original Amount and Listing Category
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(ListingCategory, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1) +
rotate_x
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(ListingCategory, LoanOriginalAmount), data = loans) +
geom_boxplot() +
rotate_x
```
`Debt Consolidation`, `Baby&Adoption` and `Business` have the highest, `Motorcycle`, `Vacation` and `Household Expenses` have the lowest original amount.
### 4.25 Monthly Loan Payment and Listing Category
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(ListingCategory, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .1, size = 1) +
rotate_x
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(ListingCategory, MonthlyLoanPayment), data = loans) +
geom_boxplot() +
rotate_x
```
Loans for `DebtConsolidation` have the highest monthly payment, followed by `NotAvailable`, `HomeImprovement`, `Business` and `Other`.
### 4.26 Loan Original Amount and Employment Status
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, size = 1) +
rotate_x
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, LoanOriginalAmount), data = loans) +
geom_boxplot() +
rotate_x
```
Employed borrowers have loans with the highest, part-time employed borrowers have with the lowest original amount.
### 4.27 Monthly Loan Payment and Employment Status
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1) +
rotate_x
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, MonthlyLoanPayment), data = loans) +
geom_boxplot() +
rotate_x
```
Employed borrowers have loans with the highest, part-time employed borrowers have with the lowest monthly payment.
### 4.28 Employment Status Duration and Employment Status
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, EmploymentStatusDuration), data = loans) +
geom_jitter(alpha = .05, size = 1) +
rotate_x
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(EmploymentStatus, EmploymentStatusDuration), data = loans) +
geom_boxplot() +
rotate_x
```
There is a relationship between the employment status and the employment status duration. `Employed`, `Full-time`, `Retired` and `Self-employed` statuses tend to have a higher duration as `Not employed` or `Part-time`.
## 5 Multivariate Analysis
In this section I am going to try to answer the questions I raised in the previous section.
### 5.1 Loan Original Amount and Monthly Payment
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(MonthlyLoanPayment, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1)
```
There are three linear relationships between original amount and monthly payment. Based on the relatively strong correlation between term and monthly payment, and knowing that there are three possible term values, term seems the best next variable to investigate.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(MonthlyLoanPayment, LoanOriginalAmount), data = loans) +
geom_jitter(alpha = .1, aes(color = factor(Term))) +
scale_color_discrete()
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(MonthlyLoanPayment, LoanOriginalAmount / Term), data = loans) +
geom_jitter(alpha = .1) +
scale_color_discrete()
```
```{r echo=FALSE, message=FALSE, warning=FALSE}
with(loans, cor.test(MonthlyLoanPayment, LoanOriginalAmount / Term))
```
Monthly loan payment almost equals to the original amount divided by the term of the loan. Therefore monthly loan payment can be described by original amount and term together.
### 5.2 Monthly Payment and Borrower Rate
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1)
```
There are multiple linear relationships between monthly payment and borrower rate.
Adding term to the plot may help again.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1) +
facet_wrap(~Term)
```
Faceting by term shows some difference between the previously seen relationships.
As term and original amount has a relationship to monthly loan payment adding original amount to the previous plot may make the picture clearer.
```{r echo=FALSE, message=FALSE, warning=FALSE}
ggplot(aes(BorrowerRate, MonthlyLoanPayment), data = loans) +
geom_jitter(alpha = .05, size = 1, aes(color = LoanOriginalAmount)) +
scale_color_continuous(low = 'light blue', high = 'dark blue') +
facet_wrap(~Term)
```
Faceting by term and colouring by original amount shows a complex relationship, were both borrower rate, orginal amount and term participates in monthly payment.
### 5.3 Borrower Rate and Borrower APR