Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue 151 #157

Open
wants to merge 21 commits into
base: main
Choose a base branch
from
Open

Fix issue 151 #157

wants to merge 21 commits into from

Conversation

Karim-Mane
Copy link
Member

This PR contains changes to adress issue #151

@Karim-Mane Karim-Mane requested a review from Bisaloo July 22, 2024 17:27
@Karim-Mane Karim-Mane self-assigned this Jul 22, 2024
@Karim-Mane Karim-Mane added the enhancement New feature or request label Jul 22, 2024
@Karim-Mane Karim-Mane linked an issue Jul 22, 2024 that may be closed by this pull request
Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some technical comments but I still don't understand the logic.

Could you add a comment at the start of scan_columns() to explain please?

E.g.,:

  • what about numeric values encoded as factors?
  • can a value be of type logical but in reality be a date?
  • are you sure dates won't be counted twice?
as.numeric(as.Date("2020-01-01"))
#> [1] 18262

Created on 2024-07-23 with reprex v2.1.1

Please add more examples & tests with more diverse column types.

R/clean_data_helpers.R Outdated Show resolved Hide resolved
tmp <- suppressWarnings(as.numeric(x))
are_numeric <- round((sum(!is.na(tmp)) / n_rows), 6L)
are_numeric <- 0L
verdict <- type == "logical" | type == "factor"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
verdict <- type == "logical" | type == "factor"
verdict <- type == "logical" || type == "factor"

Copy link
Member Author

@Karim-Mane Karim-Mane Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I accounted for this in commit 828534d.

# The percent of numeric and character value will be set automatically to 0 as
# to prevent from the effects of the conversion to numeric and character.
#
types <- as.character(vapply(data, class, character(1L)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use class() or typeof() here?

Suggested change
types <- as.character(vapply(data, class, character(1L)))
types <- vapply(data, class, character(1L))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class() will gives us the inofrmation that we need - the composition in data types of every column.

typeof() will tell us date values are double, numeric. This is not what we want to spot out from the scan_data() function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But class() doesn't always return an output of length 1 so this will error.

You will see this with a test including POSIXct columns for example

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still believe that typeof() is not appropriate for this task.

I have made changes to account for multiple types in commit db420c8.

@Bisaloo
Copy link
Member

Bisaloo commented Jul 24, 2024

Please add more examples & tests with more diverse column types.

In particular, there is still an issue with dates.

@Karim-Mane
Copy link
Member Author

Please add more examples & tests with more diverse column types.

In particular, there is still an issue with dates.

I have tested the function across multiple datasets. If you have a dataset with odd cases that are not accounted for yet, please share this with me.

Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going in the right direction.

The code is getting more and more complicated, which will inevitably lead to more bugs.

I would recommend:

  • centering the different case treatment on typeof() which has a finite, known, set of outputs
  • think about what possible scan_data() types are possible for each typeof() and create custom scan() function for each type. For example:
    • it seems that logical can only lead to logical and NA, why is it grouped with factors?
    • character are on the other extreme of the spectrum since something that appears as a character can actually be any of the types recognized by scan_data()

It may be easier to start from a blank page to not be influenced by the current code, which seems to push toward more and more complexity.
Pen & paper design, with a flow diagram of the different pathways, may help.

Comment on lines 139 to 141
get_class <- function(x) {
paste(class(x), collapse = ",")
}
Copy link
Member

@Bisaloo Bisaloo Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it's a bad idea to manipulate objects / code as text, especially to circumvent explicit safeguards

are_date <- 0L
if (any(type %in% c("POSIXct", "POSIXt"))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This solution doesn't scale as we can't possibly add special cases for every existing types. Especially since packages can define their own types.

@Karim-Mane
Copy link
Member Author

I don't think this is going in the right direction.

The code is getting more and more complicated, which will inevitably lead to more bugs.

I would recommend:

* centering the different case treatment on `typeof()` which has a finite, known, set of outputs

* think about what possible `scan_data()` types are possible for each `typeof()` and create custom `scan()` function for each type. For example:
  
  * it seems that logical can only lead to logical and NA, why is it grouped with factors?

I had identified two special cases (logical abd factors) that I wanted to resolve using their level. But can be splitted into two small functions.

  * character are on the other extreme of the spectrum since something that appears as a character can actually be any of the types recognized by `scan_data()`.

Character columns can contain data of any types. If you start by determining the proportion of characters in this function, most columns will appear to only contain characters. Detecting the other types first, then assigning the remaining to characters was the best alternative I could think of. Happy to explore any other suggestion around this.

It may be easier to start from a blank page to not be influenced by the current code, which seems to push toward more and more complexity. Pen & paper design, with a flow diagram of the different pathways, may help.

I will detach myself from this for a day or two. I will give it a try with typeof() when I get back to it after.

@Bisaloo
Copy link
Member

Bisaloo commented Jul 25, 2024

Maybe as a first question to determine the best way forward: what is the exact role of this function in a data cleaning pipeline? How is its output used to inform following steps? Why these specific types have been selected as potential types recognized by scan_data()?

@Karim-Mane
Copy link
Member Author

Karim-Mane commented Jul 29, 2024

Maybe as a first question to determine the best way forward: what is the exact role of this function in a data cleaning pipeline? How is its output used to inform following steps? Why these specific types have been selected as potential types recognized by scan_data()?

what is the exact role of this function in a data cleaning pipeline?

If by pipeline here you meant a suit of cleaning operations that are executed one after the other by passing the output of one as an input for the next function, the scan_data() function will not be used within such pipeline.
It is intended to be used prior the application of any cleaning operations. The aim here is to provide the user with a comprehensive content of every column in the dataset in terms of the proportion of the defined data types.

How is its output used to inform following steps?

For example, a column with both numeric and character values might require a call for convert_to_numeric(), or a conversion of the numeric values into character, or setting the values of one of these type to NA. With this information in hand, the user can plan the set of cleaning operations to apply on the dataset.

Why these specific types have been selected as potential types recognized by scan_data()?

The selected types are: numeric, date, character, logical. This is purely based on observation - they are the common types found in most of dataset I have encountered.
Note that here date is a combination of both Date and date-time (POSIXct and POSIXlt).

How to adjust this function?

These atomic vector types are built on top of the 4 basic types (interger, double, logical, character). We will define a function that makes use of the typeof() function. Based on the returned value of this function, We will determine the percent of factor, Date, date-time, character, numeric, and logical in every column of the dataset.

Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important criteria in all the code we produce is that it should be clear and maintainable.

When you write code, please go through it to double check if that is the case.

In general, this will happen by using standard mechanisms (e.g., to test class) and by being explicit about the values of object. E.g., if the value is 1, let's write one.

If standard patterns or mechanisms are unknown to you, a good strategy is to start following a GitHub repo of a known package and look at the changes they are making.

Comment on lines 53 to 54
names(scan_result) <- c("missing", "numeric", "date", "character",
"logical", "date-time", "factor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did this change? It doesn't give the impression that the list results from an explicit and motivated design choice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this function is now mostly returning the class(). Can we go back to the drawing board and write a list of requirements:

  • when I have X type of data, I want to know Y to inform Z

For example, a column with both numeric and character values might require a call for convert_to_numeric(), or a conversion of the numeric values into character, or setting the values of one of these type to NA. With this information in hand, the user can plan the set of cleaning operations to apply on the dataset.

This is a good example for analyzing types in character columns. What about the others?

R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
@Karim-Mane
Copy link
Member Author

@Bisaloo - Thanks for taking the time to review this.

I disagree with some comments above.
We have been working on this and other packages. When there is some code that is unclear or non-maintainable, please be specific as you used to do up to now. When the comments are this vague and written in a certain way, it is demoralising.

Let's meet tomorrow to clarify our conceptions of the function and remove all ambuguities around it.

R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Show resolved Hide resolved
R/clean_data_helpers.R Show resolved Hide resolved
Comment on lines 136 to 144

# --- get the proportion of character values ---
are_character <- round((1.0 - (are_na + are_numeric +
are_date + are_logical)), 6L)
# get the character count
character_count <- initial_length -
(na_count + logical_count + numeric_count + date_count)

# --- return the output ---
return(c(are_na, are_numeric, are_date, are_character, are_logical))
}
# transform into proportions
props <- round(
c(na_count, numeric_count, date_count, character_count, logical_count) /
initial_length, 4L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking back about this: are we sure we want to return a proportion? For example, a value could at the same time be recognized as a numeric or a potential date. Rather than guessing which one it is, shouldn't this be left to the user, who has all the context to make an informed choice?

This means the sum of the return vector could be more than 1 as we are counting some values in two categories but maybe that's not an issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guessing part only happens when Date values are identified from the parsing using lubridate::parse_date_time(). Otherwise, all numeric values are strictly considered as numeric. I think, the following could help users:

  1. Add a note about this in the function description
  2. Add the algorithm used within the function in the details section of the function documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.

For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.

Copy link
Member Author

@Karim-Mane Karim-Mane Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another proposition from @Degoot-AM

I would add a data type called ambiguous and assigned to it non-differentiable entries and that would add the sum of row to 1 (100%).

R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
@Karim-Mane
Copy link
Member Author

Karim-Mane commented Aug 5, 2024

Thanks! I think it's going in a good direction. I have left specific technical comments but here are also some general reminders to address at some point before this gets merged:

* [x]  Add an explanation in design vignettes and/or function documentation about in which scenarios this function could be useful

* [x]  Add an explanation about why only character columns are analyzed

* [ ]  Add an explanation about the rationale behind output format (normalized to 1 or not? Row- vs col-orientation? etc.)

Thanks for quoting these.
I have added some text to the design vignette and updated the function documentation. See commit e398860.
I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

@Bisaloo
Copy link
Member

Bisaloo commented Aug 6, 2024

I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

To clarify, I am not talking about proportion vs percentage.

I'm saying that we could potentially not sum to 1 (or 100%). Given the use case you described, maybe we don't care if we double count some values in different categories (e.g. date and numeric).

We should maybe(?) return a simple "can it be converted to...?" without trying to rely on complex heuristics to differentiate between date & numeric. With this, the user can decide if the entire column should be date or numeric.

Does this make more sense?

@Karim-Mane
Copy link
Member Author

I could not however provide a rationale behind the use of proportion vs percentage. I guess I am used to proportions...

To clarify, I am not talking about proportion vs percentage.

I'm saying that we could potentially not sum to 1 (or 100%). Given the use case you described, maybe we don't care if we double count some values in different categories (e.g. date and numeric).

We should maybe(?) return a simple "can it be converted to...?" without trying to rely on complex heuristics to differentiate between date & numeric. With this, the user can decide if the entire column should be date or numeric.

Does this make more sense?

This is related to the opened conversation above. Can we follow up this topic in that conversation?

R/clean_data_helpers.R Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
Comment on lines 136 to 144

# --- get the proportion of character values ---
are_character <- round((1.0 - (are_na + are_numeric +
are_date + are_logical)), 6L)
# get the character count
character_count <- initial_length -
(na_count + logical_count + numeric_count + date_count)

# --- return the output ---
return(c(are_na, are_numeric, are_date, are_character, are_logical))
}
# transform into proportions
props <- round(
c(na_count, numeric_count, date_count, character_count, logical_count) /
initial_length, 4L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this in mind, I believe my question is still valid: should we do any guessing at all? If we don't have a strong reason to, I'd suggest we don't.

For example, if we have "20210107", should it be a date or a numeric? For a Covid-19 dataset, it could be either. It's probably safer to leave the final choice to the user.

Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left two minor comments inline.

I think we're almost there:

  • let's make sure all the points discussed throughout this process are in the design vignette
  • let's finalize the conversation of double counting

@Karim-Mane
Copy link
Member Author

I left two minor comments inline.

I think we're almost there:

* let's make sure all the points discussed throughout this process are in the design vignette

* let's finalize the conversation of double counting

I have accounted for the documentation and double counting in commit 5b405f9.

Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is behaving as intended:

scan_in_character("20210702", "test")

should return c(0, 1, 1, 0, 0)

scan_in_character(c("20210702", "2021/07/03", "3"), "test")

should return c(0, 0.6666, 0.6666, 0, 0)

It would probably be helpful to add tests that test exact results, in a stricter fashion that just testing class or length.

R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
R/clean_data_helpers.R Outdated Show resolved Hide resolved
# --- get the proportion of logical values ---
are_logical <- round((sum(is.logical(x)) / n_rows), 6L)
# getting the date and numeric count as describe above
date_count <- numeric_count <- 0L
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
date_count <- numeric_count <- 0L
date_count <- 0L

Since the value of numeric_count will be set in all possible branches

R/clean_data_helpers.R Outdated Show resolved Hide resolved
Copy link
Member

@Bisaloo Bisaloo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't post it as a suggestion but I believe this much simpler implementation should work:

scan_in_character <- function(x, x_name) {
  # There might be, within a character column, values of type:
  # character, numeric, date (date or date-time), NA, and logical
  # In this function, we check the presence of these different types within a
  # character column.

  # save the variable length
  # the character count is decreased by the number of occurrence a different
  # data type is found.
  initial_length <- character_count <- length(x)

  # get the count of missing data (NA)
  na_count <- sum(is.na(x))
  x      <- x[!is.na(x)]
  character_count <- character_count - na_count

  # We will check if there is any Date values within the variable by parsing the
  # values, looking for the ones that fit any of the predefined format.
  #     When there is one or more Date values, they will be converted into
  # numeric. The first numeric count is recorded at this point. The rest of the
  # values are converted into numeric, and the second count of numeric is
  # recorded. They will in turn be converted into date.
  # If any of these numeric values is a Date (a numeric, which
  # after conversion to Date, fall within the interval
  # [50 years back from today's date, today's date]), it will add to the second
  # date count.
  # That way the Date count is the count of date identified from the
  # parsing + the count of Dates within the numeric values. Similarly, the
  # numeric count is the count of numeric values within dates values and count
  # among the non-date values.
  #
  # NOTE: This is what justifies that the sum across rows in the output object
  # could be > 1.
  #
  #     When there is no Date values identified from the parsing, the variable
  # is converted into numeric. The final numeric count is the sum of all the
  # identified numeric values.
  #     The logical count is the number of TRUE and FALSE written in both lower
  # and upper cases within the variable.
  #     The remaining values will be considered of type character.

  # parsing the vector, looking for date values
  are_date <- suppressWarnings(
    as.Date(
      lubridate::parse_date_time(
        x,
        orders = c("ymd", "ydm", "dmy", "mdy", "myd", "dym", "Ymd", "Ydm",
                   "dmY", "mdY", "mYd", "dYm")
      )
    )
  )

  date_count <- sum(!is.na(are_date))
  character_count <- character_count - date_count

  # convert everything to numeric and get the numeric count
  are_numeric <- suppressWarnings(as.numeric(x))
  numeric_count <- sum(!is.na(are_numeric))

  numeric_and_date_count <- sum(!is.na(are_date) & !is.na(are_numeric))

  if (numeric_and_date_count > 0) {
    cli::cli_alert_warning(
      "Found {numeric_and_date_count} value{?s} that can be both numeric and \\
      date in column `{x_name}`"
    )
  }

  # We don't want to remove numeric_and_date values twice
  character_count <- character_count - numeric_count + numeric_and_date_count

  # get logical count
  logicals <- toupper(x) == "TRUE" | toupper(x) == "FALSE"
  logical_count <- sum(logicals)
  character_count <- character_count - logical_count

  # transform into proportions
  props <- round(
    c(na_count, numeric_count, date_count, character_count, logical_count) /
      initial_length, 4L
  )

  return(props)
}

Please let me know if you prefer that I push it to this branch or as a PR so you can review it more easily.

Comment on lines +192 to +195
cli::cli_alert_warning(
"Found {ambiguous_count} values that can be either numeric or date in",
"column `{x_name}`"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This results in an incomplete sentence:

dat <- data.frame(col1 = c("20210702", "test"))
scan_result <- cleanepi::scan_data(data = dat)
#> ! Found 1 values that can be either numeric or date in

Created on 2024-09-02 with reprex v2.1.1

Let's also handle pluralization while we're at it

Suggested change
cli::cli_alert_warning(
"Found {ambiguous_count} values that can be either numeric or date in",
"column `{x_name}`"
)
cli::cli_alert_warning(
"Found {ambiguous_count} value{?s} that can be either numeric or date \\
in column `{x_name}`"
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bisaloo - Please create a PR.

# 2 date values (one of them `20210702` is also numeric)
# 2 numeric values in which one is also a date value, hence the
# warning about the presence of ambiguous data
dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dat <- data.frame(col1 = c(c("20210702", "2021/07/03", "3"), "test"))
dat <- data.frame(col1 = c("20210702", "2021/07/03", "3", "test"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

scan_data() gives incorrect results
3 participants