-
Notifications
You must be signed in to change notification settings - Fork 0
/
data-cleaning-audible_data.Rmd
538 lines (401 loc) · 19.7 KB
/
data-cleaning-audible_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
---
title: "Data Cleaning (R) - Audible Dataset"
author: "Shefali C."
date: "2023-10-04"
output:
html_document:
toc: true
theme: united
highlight: tango
css: audible_dataset.css
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This notebook answers some of the most common "How To.." questions that pop-up during data cleaning.
I've also cleaned this notebook in Python. You may check out the Python version [here on Kaggle](https://www.kaggle.com/code/cshefali/some-common-how-tos-in-data-cleaning-in-python).
I hope both versions of this work helps beginners to understand corresponding functions in Python and R.
The Audible Dataset used in this notebook can be found [here](https://www.kaggle.com/datasets/snehangsude/audible-dataset/data).
```{r load-libraries, message=FALSE, warning=FALSE}
library(tidyverse) # metapackage of all tidyverse packages
library(lubridate) # to work on dates
```
```{r set-working-dir}
#the working directory
working_dir <- getwd()
```
```{r read-file, message=FALSE, warning=FALSE}
#read data
audible_data = readr::read_csv(paste0(working_dir, "/data/audible_dataset/audible_uncleaned.csv"))
```
```{r data-preview1}
#preview of the data
head(audible_data, 4)
```
From the sample data above and information below, we can list out some inconsistencies in the dataframe.
1. **author-** "Writtenby:" can be removed; multiple author names can be placed in different columns.
2. **narrator-** "Narratedby:" can be removed.
3. **time-** is of type `str`; can be converted to numeric or datetime type. We'll see what should be done here.
4. **releasedate-** is of type `str`; should be of type `date`.
5. **language-** all values should be in uniform case; first letter capital, rest smallcase.
6. **stars-** can be split into i) number of stars and ii) total number of ratings columns.
7. **price-** is of type `str`; should be float.
```{r data-info1}
#information about the data
glimpse(audible_data)
```
```{r data-stats1}
#stats
summary(audible_data)
```
# Data Cleaning Steps
## 1. How to check total count of each category?
```{r books-count}
#total count of each unique book
book_count <- audible_data %>%
group_by(name) %>%
summarize(total_count = n()) %>%
arrange(-total_count)
#books with 10 or more observations.
book_count %>% filter(total_count >= 10)
```
## 2. How to check number of duplicate rows?
```{r duplicate-rows}
#total number of duplicate rows
sum(duplicated(audible_data))
```
## 3. How to remove a string pattern?
#### Changes mades to the **author** column:
1. Removed "Writtenby:".
2. Add space between first, middle(if any) and last names of the authors.
3. Since some books have multiple authors, this column is split into multiple columns with 1 author name in each.
The 3rd step is optional.
I've added it here just to showcase how to split a text column into multiple columns, and assign column names automatically using prefixes like author1, author2 etc.
```{r remove-writtenby}
#Author column
#remove the phrase "Writtenby:"
audible_data$author <- stringr::str_replace(audible_data$author,
pattern = 'Writtenby:', replacement = '')
#after removing "Written By", author column looks like:
head(audible_data[,'author'], 4)
```
## 4. How to add space between 2 or more word blocks? (here first, middle, last name)
#### Regular Expression used:
1. **"([a-z])([A-Z])"-** means match a lowercase letter followed by an uppercase letter. So, group 1 is ***[a-z]*** and group 2 is ***[A-Z]***.
2. **"\\\1 \\\2"-** means add space wherever uppercase letter(group 2) is followed by lowercase letter group(group 1).
So, for example, 'ChandlerBing' becomes 'Chandler Bing'.
**NOTE:**
1. Some rows contains muliple author names. e.g. **"ChandlerBing, RossGeller, PhoebeBuffay"**.
2. If **`str_replace()`** is used below, space will be added to only the first author name, ie. **"Chandler Bing, RossGeller, PhoebeBuffay"**.
3. Applying **`str_replace_all()`** adds space to all author names, i.e. **"Chandler Bing, Ross Geller, Phoebe Buffay"**.
```{r author-add-space}
#Add space between the first, middle and last names of Authors.
#e.g. JaneAustin becomes Jane Austin
audible_data$author <- stringr::str_replace_all(audible_data$author,
pattern = "([a-z])([A-Z])",
replacement = "\\1 \\2")
head(audible_data[,'author'], 4)
```
## 5. How to split a value into multiple columns and assign column name with prefix/suffix?
**`str_count()`** below finds the number of occurances of comma ',' in each row of author column. Then maximum number of occurances is found and 1 is added.
So, if there are three commmas in any given row, it implies that row has 4 names in the author column.
And 4 columns will be created to store each author name.
```{r count-commas}
#maximum number of authors for any book.
#helps to determine max number of columns needed to keep indiviudal authors
max_columns <- max(str_count(audible_data$author, ',')) + 1
```
```{r split-author}
#split the author column.
audible_data <- audible_data %>% separate(col = author,
into = paste0('author', seq_len(max_columns)),
sep = ',',
fill = 'right')
```
```{r tail-preview}
tail(audible_data,3)
```
## 6. How to find total number of NA in each column?
Here, I wanted to see if it's at all useful to split the author column or not.
As we can see from the NA value counts below, the newly created columns ***author2, author3, author4*** are mostly vacant.
Majority of the books have only 1 author and about 14,000 books have 2 authors.
(Total number of unique books in the data- 82767)
```{r total-na}
#find total number of NA in colnames containing 'author'
#(author1, author2, author3, author4)
rbind(colSums(is.na(audible_data %>% select(contains('author')))))
```
#### Changes made to narrator column:
1. Removed "Narratedby:".
2. Add space between first, middle and last names of the narrator.
3. Here, I haven't split this column for multiple narrators.
```{r remove-narratedby}
#remove "narrated by" from the beginning
audible_data$narrator <- stringr::str_replace(audible_data$narrator,
pattern = "Narratedby:",
replacement = "")
```
```{r narrator-addspace}
#add space between first and last names
audible_data$narrator <- stringr::str_replace_all(audible_data$narrator,
pattern = "([a-z])([A-Z])",
replacement = "\\1 \\2")
```
### Time column
1. This column contains duration of audiobooks in text format.
2. Some rows contain ***'xx hrs and yy mins'*** while others contain ***'y mins', 'less than 1 minute'*** etc.
In order to see all unique formats present, a copy of the time column has been created.
```{r copy-time}
#make a copy of time column to understand all kind of formats in which data is present
time_column = audible_data$time
```
All digits have been removed to see what all phrases are present.
```{r unique-time}
#replace all numbers with blanks
time_column = str_replace_all(time_column,
pattern = '[0-9]', replacement = '')
#keep only unique patterns
unique(time_column)
```
- **`Assumption:`** All rows with ***Less than xx minutes*** will be replaced with ***xx minutes***.
- Below, I've filtered out all row indices which have this 'less than..' values to see all distinct durations.
- Since only 1 unique value present, ***less than 1 minute***, it can be simply replaced with '1'.
- If other values were present like ***less than 5 minutes, less than 10 mins*** etc., then we would have used regular expressions to extract and store the digits.
```{r}
#find rows with "less than x minute" values in audiobook duration
less_than_duration = grep(pattern = 'less than', audible_data$time, ignore.case = T)
#check all unique values with 'less than' pattern
audible_data[less_than_duration,'time'] %>% distinct()
```
The objective is to:
1. Create 2 columns **`hour_component`** and **`min_component`** to store the hour and minute duration of the audiobook.
2. Examples:
- ***'17 hrs and 5 mins'*** will split into 17 in hour column and 5 in minute column.
- ***'24 mins'*** will be stored as 0 in hour column and 24 in minute column.
- ***'less than 1 minute'*** will be stored as 0 in hour and 1 in minute column.
```{r}
#create 2 columns- 1 for hour component and the other for minute component.
audible_data$hour_component <- 0
audible_data$min_component <- 0
```
## 7. How to extract digits before a specifc word?
Regular Expression used:
1. **"^-"** indicates start matching the pattern from beginning of the string.
2. **"([0-9]+)"-** is the capture group. Means look for one or more digits.
3. **+** means one or more occurances of digits.
4. **"([0-9]+) hr"** captures digits before the word hr or hrs.
5. **"([0-9]+) min"** captures digits before min/mins/minutes words.
```{r extract-hr-min}
#extract number of hours
audible_data$hour_component <- stringr::str_extract(audible_data$time,
pattern = '^([0-9]+) hr')
#remove the 'hr' part
audible_data$hour_component <- stringr::str_replace(audible_data$hour_component,
pattern = ' hr', replacement = '')
#extract number of minutes
audible_data$min_component <- stringr::str_extract(audible_data$time, pattern = '([0-9]+) min')
#remove the 'min' part
audible_data$min_component <- stringr::str_replace(audible_data$min_component,
pattern = ' min', replacement = '')
```
```{r view-time}
#preview of extracted time
head(audible_data %>% select(time, hour_component, min_component), 5)
```
```{r view-dtype}
#checking the datatype of all cols now
glimpse(audible_data)
```
## 8. How to convert NA values to 0?
Now, after digits extraction, values like ***'24 mins'*** will keep 24 in minute column and NA in hour column.
Similarly, a value like ***'11 hrs'*** will keep NA in minutes column.
Below, these NA values have been replaced with 0.
```{r replace-na}
audible_data$hour_component <- audible_data$hour_component %>% replace_na('0')
audible_data$min_component <- audible_data$min_component %>% replace_na('0')
```
## 9. How to change the datatype of multiple columns?
Now, we can change the datatype of hour and minutes from strings to integers.
```{r change-toint}
#convert the hour and min column to int type
audible_data = audible_data %>%
mutate(hour_component = as.integer(hour_component),
min_component = as.integer(min_component))
```
### Date column
Dates are written in dd-mm-yy format.
They are in character format and will be converted to date type.
But before that, we need to make sure that each of the components are consistent. That is:
- **dd** doesn't exceed 31.
- **mm** doesn't exceed 12.
- In the [datacard](https://www.kaggle.com/datasets/snehangsude/audible-dataset/data) on Audible dataset page, its mentioned that some books are yet to be released, which means we can have year value greater than 2023.
```{r preview-date}
head(audible_data %>% select(releasedate), 4)
```
## 10. How to check for inconsistencies in date column stored as a string?
Regular expressions used:
1. **"^([0-9]+)"-** check for the digit from beginning of string (^) and before hypher -.
2. **"-([0-9]+)-"** check for the digits between the two hyphens.
3. **"-([0-9]+)\$"-** capture the digits after hyphen sign and end of the string.
```{r check-dd}
#day component- dd
#check if any value in day part > 31.
any(
as.integer(
unique(
str_extract(string = audible_data$releasedate, pattern = '^([0-9]+)')
)
) > 31)
```
```{r check-mm}
#check for inconsistencies in month of release date
#Check to see whether middle values (months) contain any number > 12
#all unique months.
#extract all months; E.g. extracted format is: "-05-" for May
unique_months <- unique(str_extract(string = audible_data$releasedate, pattern = '-([0-9]+)-'))
#remove both hyphens
unique_months <- str_replace_all(string = unique_months, pattern = '-', repl = '')
#convert to integer and check if any value exceeds 12 (12 months)
any(as.integer(unique_months) > 12)
```
## 11. How to convert date stored as string to 'date' type?
In the Py version of this notebook [here on Kaggle](https://www.kaggle.com/code/cshefali/some-common-how-tos-in-data-cleaning-in-python), I've added a note in this part.
If date string is in format: **04-08-23**, then separator used in **"format"** parameter should be **`%d-%m-%Y`** and not **`%d/%m/%Y`**.
But no such caution was needed here. Only thing to note in case of R is to make sure the right function is used.
Since the date in our df is of form: dd-mm-yy, hence **`dmy()`** has been used.
If in the date, month was mentioned first, followed by day, then year, then **`mdy()`** would have been the right function.
```{r view-releasedate}
head(audible_data %>% select(name, releasedate), 3)
```
```{r convert-date}
#convert to date type
audible_data$release_date <- lubridate::dmy(audible_data$releasedate)
#view datatype of all cols now
glimpse(audible_data)
```
## 12. How to convert text to one uniform case?
```{r unique-lang}
#all unique languages in the dataset
unique(audible_data$language)
```
```{r capitalize-lang}
#capitalize the language names
audible_data$language <- stringr::str_to_title(audible_data$language)
```
### Stars column
The objective is to create 2 columns out of the stars column:
- **`stars_out_of_5`-** will contain numeric values like 2, 4.5 etc.
- **`total_ratings`-** total number of reviews received on the particular audiobook.
- Example: ***\"4.5 out of 5 stars7 ratings\"***
<br>
- This column will be split after the word 'stars'.
- Then **' out of 5'** will be removed and remaining number will be converted to float type.
- In the `total_ratings` column, **"ratings"** will be removed and values will be converted to float type.
```{r view-stars-col}
head(audible_data %>% select(name, stars), 3)
```
```{r split-stars-col}
#split the stars column after the word 'stars'
#rows with 'Not yet rated' return NA in 'total_ratings' column after separation
audible_data <- audible_data %>%
separate(col = stars,
into = c('stars_out_of_5', 'total_ratings'),
sep = 'stars',
#do not remove the original column
remove = F,
#row containing 'Not yet rated' returns NA.
#NA should be filled starting from right-hand-side column
fill = 'right')
```
```{r view-split-stars-col}
#some unique rows in the data
audible_data[sample(nrow(audible_data),40),c('stars', 'stars_out_of_5', 'total_ratings')] %>% distinct()
```
**Out of 87,489 rows in the df, 72,417 contain "Not yet rated" for books that have not been rated yet!**
There is no point in removing all these rows and perform an analysis. But I want to showcase some more queries that arise while cleaning a dataframe. For e.g. ***How do you extract a floating point number occuring before a text? and so on..***
The objective is to make these columns numeric by extracting the numeric part from the ratings e.g. 4.5 out of 5 becomes 4.5 & 106 ratings becomes 106.
When this is done, the "Not rated yet" rows will get converted to NA.
I wanted to keep a copy of the data cleaned so far.
So, the subsequent cleaning steps have been performed on a copy of this dataframe.
```{r clean-data-copy}
#make a copy of dataframe cleaned so far.
audible_data_copy <- audible_data
```
#### Steps taken below:
1. Remove ***'out of 5'*** from `stars_out_of_5` column to retain only the digit part.
2. Remove ***'ratings'*** from `total_ratings` column to retain only the numeric part.
3. Remove commas from values like ***1,500*** in `total_ratings` column. (Bcz this column has to be converted to numeric.)
```{r clean-stars-col}
#remove 'out of 5' from 'stars_out_of_5' column
audible_data$stars_out_of_5 <- stringr::str_replace(audible_data$stars_out_of_5,
pattern = ' out of 5', replacement = '')
#remove 'ratings' from 'total_ratings' column
audible_data$total_ratings <- stringr::str_replace(audible_data$total_ratings,
pattern = ' ratings', replacement = '')
#remove commas from values like 1,500 from total_ratings
audible_data$total_ratings <- str_replace(audible_data$total_ratings,
pattern = ',', replacement = '')
```
```{r change-dtype-stars}
#finally change the datatype of both columns
audible_data <- audible_data %>%
mutate(stars_out_of_5 = as.numeric(stars_out_of_5),
total_ratings = as.integer(total_ratings))
glimpse(audible_data)
```
### Price column
#### Steps taken:
1. Replace 'Free' with 0.
2. Remove commas from prices. Currently, this column is of type str; it has to be converted to numeric.
3. Convert to float.
```{r clean-price}
#convert price to float type
#replace 'free' with 0
audible_data$price = str_replace(audible_data$price, regex(pattern = 'free',ignore_case = TRUE),
replacement = '')
#remove ',' from the values
audible_data$price = str_replace(audible_data$price, pattern = ',', replacement = '')
#convert to float
audible_data$price = as.numeric(audible_data$price)
```
```{r view-data-structure}
str(audible_data)
```
```{r view-colnames}
colnames(audible_data)
```
And finally, I've created a subset of the dataframe with all relevant columns.
```{r}
#select relevant columns
final_data <- audible_data %>%
select(name, contains('author'), narrator, time, hour_component, min_component,
release_date, language, stars_out_of_5, total_ratings, price)
#view a sample of the final data
final_data[sample(nrow(final_data),5),]
```
## 14. How to extract decimal numbers from a text?
I used this step in initial phase of cleaning this dataframe, but then dropped it from the final draft.
But since this notebook contains "How To.." steps, I've included it below just for the sake of future reference.
1. **`stars_out_of_5` column-** example- ***4.5 out of 5***.
- **What I've done above**- replace 'out of 5' with '', then convert 4.5 from string to float.
- **Another way**- extract the floating-point number from this text, i.e. 4.5 and then convert it to float.
```{r create-example}
#example
ratings <- data.frame(rating1 = c('4.5 out of 5', '3 out of 5', '.5 out of 5'))
ratings$rating2 <- ratings$rating1
```
Regular expression used:
1. **[0-9]\*-** means 0 or more occurance of a digit.
2. **"\.?"-** means presence of a decimal point is optional.
3. So, **"[0-9]\*.?"** means 0 or more occurances of digits before decimal point.
This pattern helps to capture values like **.5** where there are no digits before decimal.
4. **".?[0-9]+"** ensures that there must be one or more digits after the decimal point.
```{r extract-float-nums}
ratings$rating2 <- str_extract(ratings$rating1, regex(pattern = '([0-9]*\\.?[0-9]+)'))
ratings
```
So, this is it!
Ofc, there are many more ways of carrying out the cleaning process.
I hope this will be helpful for the readers in some way.
Thanks for reading :)