Fix issue 151 #157

Karim-Mane · 2024-07-22T17:27:06Z

This PR contains changes to adress issue #151

Bisaloo

I have added some technical comments but I still don't understand the logic.

Could you add a comment at the start of scan_columns() to explain please?

E.g.,:

what about numeric values encoded as factors?
can a value be of type logical but in reality be a date?
are you sure dates won't be counted twice?

as.numeric(as.Date("2020-01-01"))
#> [1] 18262

^{Created on 2024-07-23 with reprex v2.1.1}

Please add more examples & tests with more diverse column types.

R/clean_data_helpers.R

Bisaloo · 2024-07-23T10:32:29Z

R/clean_data_helpers.R

-  tmp         <- suppressWarnings(as.numeric(x))
-  are_numeric <- round((sum(!is.na(tmp)) / n_rows), 6L)
+  are_numeric <- 0L
+  verdict     <- type == "logical" | type == "factor"


Suggested change

verdict <- type == "logical" | type == "factor"

verdict <- type == "logical" || type == "factor"

I accounted for this in commit 828534d.

Bisaloo · 2024-07-23T10:35:09Z

R/clean_data_helpers.R

+  # The percent of numeric and character value will be set automatically to 0 as
+  # to prevent from the effects of the conversion to numeric and character.
+  #
+  types         <- as.character(vapply(data, class, character(1L)))


Do we want to use class() or typeof() here?

Suggested change

types <- as.character(vapply(data, class, character(1L)))

types <- vapply(data, class, character(1L))

class() will gives us the inofrmation that we need - the composition in data types of every column.

typeof() will tell us date values are double, numeric. This is not what we want to spot out from the scan_data() function.

But class() doesn't always return an output of length 1 so this will error.

You will see this with a test including POSIXct columns for example

I still believe that typeof() is not appropriate for this task.

I have made changes to account for multiple types in commit db420c8.

Bisaloo · 2024-07-24T12:43:03Z

Please add more examples & tests with more diverse column types.

In particular, there is still an issue with dates.

Karim-Mane · 2024-07-25T10:36:35Z

Please add more examples & tests with more diverse column types.

In particular, there is still an issue with dates.

I have tested the function across multiple datasets. If you have a dataset with odd cases that are not accounted for yet, please share this with me.

Bisaloo

I don't think this is going in the right direction.

The code is getting more and more complicated, which will inevitably lead to more bugs.

I would recommend:

centering the different case treatment on typeof() which has a finite, known, set of outputs
think about what possible scan_data() types are possible for each typeof() and create custom scan() function for each type. For example:
- it seems that logical can only lead to logical and NA, why is it grouped with factors?
- character are on the other extreme of the spectrum since something that appears as a character can actually be any of the types recognized by scan_data()

It may be easier to start from a blank page to not be influenced by the current code, which seems to push toward more and more complexity.
Pen & paper design, with a flow diagram of the different pathways, may help.

Bisaloo · 2024-07-25T11:33:17Z

R/clean_data_helpers.R

+get_class <- function(x) {
+  paste(class(x), collapse = ",")
+}


In general, it's a bad idea to manipulate objects / code as text, especially to circumvent explicit safeguards

Bisaloo · 2024-07-25T11:34:01Z

R/clean_data_helpers.R

  are_date <- 0L
+  if (any(type %in% c("POSIXct", "POSIXt"))) {


This solution doesn't scale as we can't possibly add special cases for every existing types. Especially since packages can define their own types.

Karim-Mane · 2024-07-25T12:32:59Z

I don't think this is going in the right direction.

The code is getting more and more complicated, which will inevitably lead to more bugs.

I would recommend:
* centering the different case treatment on `typeof()` which has a finite, known, set of outputs

* think about what possible `scan_data()` types are possible for each `typeof()` and create custom `scan()` function for each type. For example:
  
  * it seems that logical can only lead to logical and NA, why is it grouped with factors?

I had identified two special cases (logical abd factors) that I wanted to resolve using their level. But can be splitted into two small functions.

  * character are on the other extreme of the spectrum since something that appears as a character can actually be any of the types recognized by `scan_data()`.

Character columns can contain data of any types. If you start by determining the proportion of characters in this function, most columns will appear to only contain characters. Detecting the other types first, then assigning the remaining to characters was the best alternative I could think of. Happy to explore any other suggestion around this.

It may be easier to start from a blank page to not be influenced by the current code, which seems to push toward more and more complexity. Pen & paper design, with a flow diagram of the different pathways, may help.

I will detach myself from this for a day or two. I will give it a try with typeof() when I get back to it after.

Bisaloo · 2024-07-25T12:42:54Z

Maybe as a first question to determine the best way forward: what is the exact role of this function in a data cleaning pipeline? How is its output used to inform following steps? Why these specific types have been selected as potential types recognized by scan_data()?

Karim-Mane · 2024-07-29T11:00:24Z

Maybe as a first question to determine the best way forward: what is the exact role of this function in a data cleaning pipeline? How is its output used to inform following steps? Why these specific types have been selected as potential types recognized by scan_data()?

what is the exact role of this function in a data cleaning pipeline?

If by pipeline here you meant a suit of cleaning operations that are executed one after the other by passing the output of one as an input for the next function, the scan_data() function will not be used within such pipeline.
It is intended to be used prior the application of any cleaning operations. The aim here is to provide the user with a comprehensive content of every column in the dataset in terms of the proportion of the defined data types.

How is its output used to inform following steps?

For example, a column with both numeric and character values might require a call for convert_to_numeric(), or a conversion of the numeric values into character, or setting the values of one of these type to NA. With this information in hand, the user can plan the set of cleaning operations to apply on the dataset.

Why these specific types have been selected as potential types recognized by scan_data()?

The selected types are: numeric, date, character, logical. This is purely based on observation - they are the common types found in most of dataset I have encountered.
Note that here date is a combination of both Date and date-time (POSIXct and POSIXlt).

How to adjust this function?

These atomic vector types are built on top of the 4 basic types (interger, double, logical, character). We will define a function that makes use of the typeof() function. Based on the returned value of this function, We will determine the percent of factor, Date, date-time, character, numeric, and logical in every column of the dataset.

Bisaloo

The most important criteria in all the code we produce is that it should be clear and maintainable.

When you write code, please go through it to double check if that is the case.

In general, this will happen by using standard mechanisms (e.g., to test class) and by being explicit about the values of object. E.g., if the value is 1, let's write one.

If standard patterns or mechanisms are unknown to you, a good strategy is to start following a GitHub repo of a known package and look at the changes they are making.

Bisaloo · 2024-07-29T20:07:35Z

R/clean_data_helpers.R

+  names(scan_result) <- c("missing", "numeric", "date", "character",
+                          "logical", "date-time", "factor")


Why did this change? It doesn't give the impression that the list results from an explicit and motivated design choice?

It seems that this function is now mostly returning the class(). Can we go back to the drawing board and write a list of requirements:

when I have X type of data, I want to know Y to inform Z

For example, a column with both numeric and character values might require a call for convert_to_numeric(), or a conversion of the numeric values into character, or setting the values of one of these type to NA. With this information in hand, the user can plan the set of cleaning operations to apply on the dataset.

This is a good example for analyzing types in character columns. What about the others?

R/clean_data_helpers.R

Karim-Mane · 2024-07-29T23:32:18Z

@Bisaloo - Thanks for taking the time to review this.

I disagree with some comments above.
We have been working on this and other packages. When there is some code that is unclear or non-maintainable, please be specific as you used to do up to now. When the comments are this vague and written in a certain way, it is demoralising.

Let's meet tomorrow to clarify our conceptions of the function and remove all ambuguities around it.

R/clean_data_helpers.R

Bisaloo · 2024-08-01T17:44:48Z

R/clean_data_helpers.R


-  # --- get the proportion of character values ---
-  are_character <- round((1.0 - (are_na + are_numeric +
-                                   are_date + are_logical)), 6L)
+  # get the character count
+  character_count <- initial_length -
+    (na_count + logical_count + numeric_count + date_count)

-  # --- return the output ---
-  return(c(are_na, are_numeric, are_date, are_character, are_logical))
-}
+  # transform into proportions
+  props <- round(
+    c(na_count, numeric_count, date_count, character_count, logical_count) /
+      initial_length, 4L)


Thinking back about this: are we sure we want to return a proportion? For example, a value could at the same time be recognized as a numeric or a potential date. Rather than guessing which one it is, shouldn't this be left to the user, who has all the context to make an informed choice?

This means the sum of the return vector could be more than 1 as we are counting some values in two categories but maybe that's not an issue.

The guessing part only happens when Date values are identified from the parsing using lubridate::parse_date_time(). Otherwise, all numeric values are strictly considered as numeric. I think, the following could help users:

Add a note about this in the function description

Add the algorithm used within the function in the details section of the function documentation.

With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.

For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.

Here is another proposition from @Degoot-AM

I would add a data type called ambiguous and assigned to it non-differentiable entries and that would add the sum of row to 1 (100%).

R/clean_data_helpers.R

Karim-Mane · 2024-08-05T23:49:52Z

Thanks! I think it's going in a good direction. I have left specific technical comments but here are also some general reminders to address at some point before this gets merged:
* [x]  Add an explanation in design vignettes and/or function documentation about in which scenarios this function could be useful

* [x]  Add an explanation about why only character columns are analyzed

* [ ]  Add an explanation about the rationale behind output format (normalized to 1 or not? Row- vs col-orientation? etc.)

Thanks for quoting these.
I have added some text to the design vignette and updated the function documentation. See commit e398860.
I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

Bisaloo · 2024-08-06T08:56:58Z

I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

To clarify, I am not talking about proportion vs percentage.

I'm saying that we could potentially not sum to 1 (or 100%). Given the use case you described, maybe we don't care if we double count some values in different categories (e.g. date and numeric).

We should maybe(?) return a simple "can it be converted to...?" without trying to rely on complex heuristics to differentiate between date & numeric. With this, the user can decide if the entire column should be date or numeric.

Does this make more sense?

Karim-Mane · 2024-08-08T13:10:22Z

I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

To clarify, I am not talking about proportion vs percentage.

I'm saying that we could potentially not sum to 1 (or 100%). Given the use case you described, maybe we don't care if we double count some values in different categories (e.g. date and numeric).

We should maybe(?) return a simple "can it be converted to...?" without trying to rely on complex heuristics to differentiate between date & numeric. With this, the user can decide if the entire column should be date or numeric.

Does this make more sense?

This is related to the opened conversation above. Can we follow up this topic in that conversation?

R/clean_data_helpers.R

Bisaloo · 2024-08-08T13:47:04Z

R/clean_data_helpers.R


-  # --- get the proportion of character values ---
-  are_character <- round((1.0 - (are_na + are_numeric +
-                                   are_date + are_logical)), 6L)
+  # get the character count
+  character_count <- initial_length -
+    (na_count + logical_count + numeric_count + date_count)

-  # --- return the output ---
-  return(c(are_na, are_numeric, are_date, are_character, are_logical))
-}
+  # transform into proportions
+  props <- round(
+    c(na_count, numeric_count, date_count, character_count, logical_count) /
+      initial_length, 4L)


With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.

For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.

Bisaloo

I left two minor comments inline.

I think we're almost there:

let's make sure all the points discussed throughout this process are in the design vignette
let's finalize the conversation of double counting

Karim-Mane · 2024-08-08T23:01:18Z

I left two minor comments inline.

I think we're almost there:

* let's make sure all the points discussed throughout this process are in the design vignette

* let's finalize the conversation of double counting

I have accounted for the documentation and double counting in commit 5b405f9.

Bisaloo

I don't think this is behaving as intended:

scan_in_character("20210702", "test")

should return c(0, 1, 1, 0, 0)

scan_in_character(c("20210702", "2021/07/03", "3"), "test")

should return c(0, 0.6666, 0.6666, 0, 0)

It would probably be helpful to add tests that test exact results, in a stricter fashion that just testing class or length.

R/clean_data_helpers.R

Bisaloo · 2024-08-13T09:53:09Z

R/clean_data_helpers.R

-  # --- get the proportion of logical values ---
-  are_logical   <- round((sum(is.logical(x)) / n_rows), 6L)
+  # getting the date and numeric count as describe above
+  date_count <- numeric_count <- 0L


Suggested change

date_count <- numeric_count <- 0L

date_count <- 0L

Since the value of numeric_count will be set in all possible branches

R/clean_data_helpers.R

Co-authored-by: Hugo Gruson <10783929+Bisaloo@users.noreply.github.com>

…for similar occurence of `six (06)`

Bisaloo

I can't post it as a suggestion but I believe this much simpler implementation should work:

scan_in_character <- function(x, x_name) {
  # There might be, within a character column, values of type:
  # character, numeric, date (date or date-time), NA, and logical
  # In this function, we check the presence of these different types within a
  # character column.

  # save the variable length
  # the character count is decreased by the number of occurrence a different
  # data type is found.
  initial_length <- character_count <- length(x)

  # get the count of missing data (NA)
  na_count <- sum(is.na(x))
  x      <- x[!is.na(x)]
  character_count <- character_count - na_count

  # We will check if there is any Date values within the variable by parsing the
  # values, looking for the ones that fit any of the predefined format.
  #     When there is one or more Date values, they will be converted into
  # numeric. The first numeric count is recorded at this point. The rest of the
  # values are converted into numeric, and the second count of numeric is
  # recorded. They will in turn be converted into date.
  # If any of these numeric values is a Date (a numeric, which
  # after conversion to Date, fall within the interval
  # [50 years back from today's date, today's date]), it will add to the second
  # date count.
  # That way the Date count is the count of date identified from the
  # parsing + the count of Dates within the numeric values. Similarly, the
  # numeric count is the count of numeric values within dates values and count
  # among the non-date values.
  #
  # NOTE: This is what justifies that the sum across rows in the output object
  # could be > 1.
  #
  #     When there is no Date values identified from the parsing, the variable
  # is converted into numeric. The final numeric count is the sum of all the
  # identified numeric values.
  #     The logical count is the number of TRUE and FALSE written in both lower
  # and upper cases within the variable.
  #     The remaining values will be considered of type character.

  # parsing the vector, looking for date values
  are_date <- suppressWarnings(
    as.Date(
      lubridate::parse_date_time(
        x,
        orders = c("ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm",
                   "dmY", "mdY", "mYd", "dYm")
      )
    )
  )

  date_count <- sum(!is.na(are_date))
  character_count <- character_count - date_count

  # convert everything to numeric and get the numeric count
  are_numeric <- suppressWarnings(as.numeric(x))
  numeric_count <- sum(!is.na(are_numeric))

  numeric_and_date_count <- sum(!is.na(are_date) & !is.na(are_numeric))

  if (numeric_and_date_count > 0) {
    cli::cli_alert_warning(
      "Found {numeric_and_date_count} value{?s} that can be both numeric and \\
      date in column `{x_name}`"
    )
  }

  # We don't want to remove numeric_and_date values twice
  character_count <- character_count - numeric_count + numeric_and_date_count

  # get logical count
  logicals <- toupper(x) == "TRUE" | toupper(x) == "FALSE"
  logical_count <- sum(logicals)
  character_count <- character_count - logical_count

  # transform into proportions
  props <- round(
    c(na_count, numeric_count, date_count, character_count, logical_count) /
      initial_length, 4L
  )

  return(props)
}

Please let me know if you prefer that I push it to this branch or as a PR so you can review it more easily.

Bisaloo · 2024-09-02T12:55:47Z

R/clean_data_helpers.R

+      cli::cli_alert_warning(
+        "Found {ambiguous_count} values that can be either numeric or date in",
+        "column `{x_name}`"
+      )


This results in an incomplete sentence:

dat <- data.frame(col1 = c("20210702", "test")) scan_result <- cleanepi::scan_data(data = dat) #> ! Found 1 values that can be either numeric or date in

^{Created on 2024-09-02 with reprex v2.1.1}

Let's also handle pluralization while we're at it

Suggested change

cli::cli_alert_warning(

"Found {ambiguous_count} values that can be either numeric or date in",

"column `{x_name}`"

)

cli::cli_alert_warning(

"Found {ambiguous_count} value{?s} that can be either numeric or date \\

in column `{x_name}`"

)

@Bisaloo - Please create a PR.

Bisaloo · 2024-09-02T13:00:47Z

tests/testthat/test-clean_data_helpers.R

+  # 2 date values (one of them `20210702` is also numeric)
+  # 2 numeric values in which one is also a date value, hence the
+  # warning about the presence of ambiguous data
+  dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test"))


Suggested change

dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test"))

dat <- data.frame(col1 = c("20210702", "2021/07/03", "3", "test"))

Karim-Mane requested a review from Bisaloo July 22, 2024 17:27

Karim-Mane self-assigned this Jul 22, 2024

Karim-Mane added the enhancement New feature or request label Jul 22, 2024

Karim-Mane linked an issue Jul 22, 2024 that may be closed by this pull request

scan_data() gives incorrect results #151

Open

Karim-Mane force-pushed the fix-issue-151 branch from c0d50c2 to dcc10be Compare July 23, 2024 09:18

Bisaloo requested changes Jul 23, 2024

View reviewed changes

Bisaloo requested changes Jul 25, 2024

View reviewed changes

Bisaloo requested changes Jul 29, 2024

View reviewed changes

Karim-Mane force-pushed the fix-issue-151 branch from 272cb8b to 83a9150 Compare August 5, 2024 11:17

Bisaloo reviewed Aug 5, 2024

View reviewed changes

Bisaloo reviewed Aug 8, 2024

View reviewed changes

Bisaloo requested changes Aug 13, 2024

View reviewed changes

Karim-Mane and others added 7 commits August 21, 2024 13:27

update how the design diagram is inserted in the package design vignette

a1cdf1a

fix bug in scan_data() function

64b6dc1

fix bug in scan_data() function

f069c59

Automatic readme update

1df81b9

fix bug in scan_data() function

078514f

account for other column types

26daa9e

account for other column types

5f2b59f

Karim-Mane and others added 11 commits August 21, 2024 13:27

refactor the scan_data() function

ae5c490

refactor the scan_data() function

eb64d9f

update the scan_date() function - only scan through character columns

17fee39

add/update rationale and documentation of the function

3877253

account for hugo's suggestions

bf533ab

Update R/clean_data_helpers.R

6dbe40f

Co-authored-by: Hugo Gruson <10783929+Bisaloo@users.noreply.github.com>

update design vignette and function documentation: made when looking …

437a1be

…for similar occurence of `six (06)`

allow for double counting of numeric values as date when appropriate

46fb2ca

add cli as a dependency

1a4e139

update scan_data() internal functions

1af8dd0

fix linters

356a143

Karim-Mane force-pushed the fix-issue-151 branch from d70b13f to 356a143 Compare August 21, 2024 13:29

Karim-Mane added 3 commits August 21, 2024 15:26

remove the else statement

a372271

update how oldest_date is calculated

fa1d9d7

update counting approach

0be056a

Karim-Mane mentioned this pull request Aug 30, 2024

Remove arsenal dependency and fix empty row index bug #173

Merged

5 tasks

Bisaloo reviewed Sep 2, 2024

View reviewed changes

	verdict <- type == "logical" \| type == "factor"
	verdict <- type == "logical" \|\| type == "factor"

	types <- as.character(vapply(data, class, character(1L)))
	types <- vapply(data, class, character(1L))

		names(scan_result) <- c("missing", "numeric", "date", "character",
		"logical", "date-time", "factor")

	dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test"))
	dat <- data.frame(col1 = c("20210702", "2021/07/03", "3", "test"))

Fix issue 151 #157

Are you sure you want to change the base?

Fix issue 151 #157

Conversation

Karim-Mane commented Jul 22, 2024

Bisaloo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Karim-Mane Jul 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bisaloo commented Jul 24, 2024

Karim-Mane commented Jul 25, 2024

Bisaloo left a comment • edited Loading

Choose a reason for hiding this comment

Bisaloo Jul 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Karim-Mane commented Jul 25, 2024

Bisaloo commented Jul 25, 2024

Karim-Mane commented Jul 29, 2024 • edited Loading

Bisaloo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Karim-Mane commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Karim-Mane Aug 9, 2024 • edited Loading

Choose a reason for hiding this comment

Karim-Mane commented Aug 5, 2024 • edited Loading

Bisaloo commented Aug 6, 2024

Karim-Mane commented Aug 8, 2024

Choose a reason for hiding this comment

Bisaloo left a comment

Choose a reason for hiding this comment

Karim-Mane commented Aug 8, 2024

Bisaloo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bisaloo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Karim-Mane Jul 24, 2024 •

edited

Loading

Bisaloo left a comment •

edited

Loading

Bisaloo Jul 25, 2024 •

edited

Loading

Karim-Mane commented Jul 29, 2024 •

edited

Loading

Karim-Mane Aug 9, 2024 •

edited

Loading

Karim-Mane commented Aug 5, 2024 •

edited

Loading