Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submission: BaseSet #359

Closed
10 of 29 tasks
llrs opened this issue Jan 20, 2020 · 61 comments
Closed
10 of 29 tasks

Submission: BaseSet #359

llrs opened this issue Jan 20, 2020 · 61 comments

Comments

@llrs
Copy link
Member

llrs commented Jan 20, 2020

Submitting Author: Lluís (@llrs)
Repository: llrs/BaseSet
Version submitted: 0.0.10
Editor: @annakrystalli
Reviewer 1: @arendsee
Reviewer 2: @j23414
Archive: TBD
Version accepted: TBD


  • Paste the full DESCRIPTION file inside a code block below:
Package: BaseSet
Title: Provides classes for working with sets
Version: 0.0.10
Authors@R: 
    person(given = "Lluís ",
           family = "Revilla Sancho",
           role = c("aut", "cre"),
           email = "lluis.revilla@gmail.com")
Description: A set collection, while not "tidy" in itself, can
    be thought of as three tidy data frames describing sets, elements and
    relations respectively. 'BaseSet' provides an approach to manipulate,
    load and use these virtual data frames.
License: MIT + file LICENSE
URL: https://github.com/llrs/BaseSet
BugReports: https://github.com/llrs/BaseSet/issues
Depends: 
    R (>= 3.6.0)
Imports: 
    dplyr (>= 0.7.8),
    magrittr,
    methods,
    rlang,
    utils,
    xml2
Suggests: 
    BiocStyle,
    covr,
    forcats,
    ggplot2,
    GO.db,
    GSEABase,
    knitr,
    org.Hs.eg.db,
    reactome.db,
    rmarkdown,
    spelling,
    testthat (>= 2.1.0),
    Biobase
VignetteBuilder: 
    knitr
Encoding: UTF-8
Language: en-US
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.0.2
Collate: 
    'AllClasses.R'
    'AllGenerics.R'
    'GMT.R'
    'GeneSetCollection.R'
    'activate.R'
    'add.R'
    'add_column.R'
    'add_relation.R'
    'adjacency.R'
    'arrange.R'
    'basesets-package.R'
    'cartesian.R'
    'complement.R'
    'data_frame.R'
    'deactivate.R'
    'droplevels.R'
    'elements.R'
    'filter.R'
    'group.R'
    'group_by.R'
    'head.R'
    'incidence.R'
    'independent.R'
    'operations.R'
    'intersection.R'
    'length.R'
    'list.R'
    'move_to.R'
    'mutate.R'
    'names.R'
    'naming.R'
    'nested.R'
    'obo.R'
    'power_set.R'
    'print.R'
    'pull.R'
    'relations.R'
    'remove.R'
    'remove_column.R'
    'rename.R'
    'select.R'
    'set.R'
    'size.R'
    'subtract.R'
    'tidy-set.R'
    'union.R'
    'utils-pipe.R'
    'xml.R'
    'zzz.R'

Scope

  • Please indicate which category or categories from our package fit policies this package falls under: (Please check an appropriate box below. If you are unsure, we suggest you make a pre-submission inquiry.):

    • data retrieval
    • data extraction
    • database access
    • data munging
    • data deposition
    • workflow automataion
    • version control
    • citation management and bibliometrics
    • scientific software wrappers
    • database software bindings
    • geospatial data
    • text analysis
  • Explain how and why the package falls under these categories (briefly, 1-2 sentences):

The package implements methods to work on sets, doing intersection, union, complementary and other set operations in a "tidy" way. It also allows to import from several formats used in the life science world. Like the GMT and the GAF or the OBO format file for ontologies.

  • Who is the target audience and what are scientific applications of this package?

The idea is to use the package for working with sets and signatures of genes in scRNAseq or in pathways and ontologies but it might work with other fields.

There is the sets package which implements a more generalized approach, that can store functions or lists as an element of a set (while mine it only allows to store a character or factor), but it is harder to operate in a tidy/long way. Also the operations of intersection and union need to happen between two different objects, while TidySet objects (the class implemented in BaseSet) can store a single set or thousands of them.
In BaseSet is easier to operate and implement new fuzzy logic operations. It is developed openly on github compared to sets which I couldn't track how it is being developed.

The GSEABase partially implements this, but it doesn't allow to store fuzzy sets and it is also quite slow as it creates several classes for annotating each set. Neither does the BiocSets the package, which don't use the fuzzy set logic.

There is also the hierarchicalSets package that is focused on clustering of sets that are inside other sets and visualizations. However, BaseSet is focused on storing and manipulate sets including hierarchical sets.

  • If you made a pre-submission enquiry, please paste the link to the corresponding issue, forum post, or other discussion, or @tag the editor you contacted.

Most of the replies are copied from #339, handeled by @melvidoni.

Technical checks

Confirm each of the following by checking the box. This package:

Publication options

JOSS Options
  • The package has an obvious research application according to JOSS's definition.
    • The package contains a paper.md matching JOSS's requirements with a high-level description in the package root or in inst/.
    • The package is deposited in a long-term repository with the DOI:
    • (Do not submit your package separately to JOSS)
MEE Options
  • The package is novel and will be of interest to the broad readership of the journal.
  • The manuscript describing the package is no longer than 3000 words.
  • You intend to archive the code for the package in a long-term repository which meets the requirements of the journal (see MEE's Policy on Publishing Code)
  • (Scope: Do consider MEE's Aims and Scope for your manuscript. We make no guarantee that your manuscript will be within MEE scope.)
  • (Although not required, we strongly recommend having a full manuscript prepared when you submit here.)
  • (Please do not submit your package separately to Methods in Ecology and Evolution)

Code of conduct

@annakrystalli
Copy link
Contributor

Hi @llrs! 👋

I'll be the editor handling your review. I'm starting the inital editors' checks shortly. Will check back in when they are done. 😊

@annakrystalli
Copy link
Contributor

Hello again @llrs!

Thanks for your patience with the editors checks. I've been traveling but I've also been having some issues with the checks which I'm having to seek some help with. Will be with you shortly, if not with the full checks, at least with those that I am able to perform successfully.

@annakrystalli
Copy link
Contributor

Editor checks:

  • Fit: The package meets criteria for fit and overlap
  • Automated tests: Package has a testing suite and is tested via Travis-CI or another CI service.
  • License: The package has a CRAN or OSI accepted license
  • Repository: The repository link resolves correctly
  • Archive (JOSS only, may be post-review): The repository DOI resolves correctly
  • Version (JOSS only, may be post-review): Does the release version given match the GitHub release (v1.0.0)?

Editor comments

Installation issues

When I try to do a full install and build vignettes, some suggested bioconductor dependencies are not installing and causing errors when attempting to build the vignettes.

devtools::install_github("llrs/BaseSet", dependencies = T, build_vignettes = T)
#> Installing 7 packages: BiocStyle, GO.db, GSEABase, org.Hs.eg.db, reactome.db, Biobase, BH
#> Error: (converted from warning) packages 'BiocStyle', 'GO.db', 'GSEABase', 'org.Hs.eg.db', 'reactome.db', 'Biobase' are not available (for R version 3.6.0)

***
   
   Error: Vignette re-building failed.
   Execution halted
Error in (function (command = NULL, args = character(), error_on_status = TRUE,  : 
  System command error
  

Reviewevers as well as potential future collaborators that will need to run devtools::check() and devtools::document(), need a section in the README or a separate CONTRIBUTING.md documenting the installation instructions required to install all dependencies (including suggested).

eg:

BiocManager::install("BiocStyle")
BiocManager::install("org.Hs.eg.db", type = "source")
BiocManager::install("GO.db", type = "source")
BiocManager::install("reactome.db", type = "source")

# test that package install and docs build successfully
devtools::install_github("llrs/BaseSet", dependencies = T, build_vignettes = T, force = T)

You can check out more of the adventures I had troubleshooting in this issue comment for more details. If you have some extra insight (as I imagine you are more familiar with bioconductor) to add to the discussion it would be more than welcome!

check() and test()

All good

goodpractice::gp() output

A few flags for minor formatting.

It is good practice to

  ✖ avoid long code lines, it is bad for
    readability. Also, many people prefer editor windows
    that are about 80 characters wide. Try make your
    lines shorter than 80 characters

    R/AllClasses.R:105:1
    R/AllGenerics.R:149:1
    R/AllGenerics.R:163:1
    R/length.R:171:1
    R/obo.R:122:1
    ... and 16 more lines

This is actually a bit more important. Is there a reason you are importing methods and utils packages as a whole?

  ✖ not import packages as a whole, as this
    can cause name clashes between the imported
    packages. Instead, import only the specific
    functions you need.
──────────────────────────────────────────────────────────────── 

Test coverage

Test coverage is overall good but is there a reason there are some files that are not tested at all or below 75%?

See the files indicated below:

BaseSet Coverage: 89.13%
R/xml.R: 0.00%
R/elements.R: 62.50%
R/set.R: 62.50%
R/size.R: 71.43%
R/pull.R: 73.33%

@llrs, could you please add the rOpenSci under review badge to your README?

[![](https://badges.ropensci.org/359_status.svg)](https://github.com/ropensci/software-review/issues/359)

Other than these minor issues, I am happy to start seeking reviewers.


Reviewers:
Due date:

@llrs
Copy link
Member Author

llrs commented Jan 29, 2020

Many thanks for your review @annakrystalli

  • Installation problems: Oh, sorry on travis and my computer I didn't had these installation problems. I would follow the BiocViews solution, but usually the easier solution might be to use BiocManager::install("llrs/BaseSet"). Lately there has been some problems with the Bioconductor binaries for Windows related to BiocStyle (Some of your problems might come from this, but I'm not sure as I have not submitted a data package).
  • Regarding long lines: Some of I had considered them easier to read with slightly over 80 character per line, some of them will be me forgetting to reformat them. I'll check each one of those lines.
  • {methods} & {utils}: I thought that it would be easier to work with the S4 class if I imported both packages and I use many functions of them. It is easier for me while writing the package too. They are both installed by default so I did not expect much problems, but I will import only the functions needed.
  • Test coverage: I forgot to delete the xml.R file. I think there isn't a compromise on which format are written with sets on .xml files. The other few packages that are below 75% might be that I didn't check for some exceptions.
  • rOpenSci badge: Will add it.

I'll modify the package to address your points by the beginning of next week I'll have them addressed. I'll let you know when I updated the package.

@annakrystalli
Copy link
Contributor

Hello again @llrs,

Aha, it seems I have been trying to install the package the wrong way all this time, by using devtools::use_github() as I normally do for editors checks when I should be using BiocManager::install()

However, when I try to install using BiocManager::install("llrs/BaseSet", dependencies = TRUE, build_vignettes = TRUE) I still get the same errors for a couple of suggested bioconductor packages I removed to test it out.

I'm really intrigued by the fact that the packages builds successfully for you on TRAVIS. Looking at your travis.yml I see you are testing on r: bioc-devel only. I think this where the testing computational environment and my own (and of reviewers and future collaborators) would differ. I feel also testing on r: release at the very least would give a better idea of what challenges others would encounter. Would you mind trying that out for me?

@llrs
Copy link
Member Author

llrs commented Jan 30, 2020

Some of this errors might be due to a failed built on Bioconductor, they recently made some changes on the checks performed on packages. Could you please post the error you got?

On travis I am only testing on bioc-devel, but on my computer I am using R 3.6.1 with Bioconductor 3.10 and it builds successfully. I'm not sure this difference is the source of the problems you found. However, I'll add the build and test for r release.

@annakrystalli
Copy link
Contributor

So the errors I am getting are the same as I was previously with using devtools::install_github(), namely, any missing bioconductor packages that need installing from source are missing. I've deleted package GO.db from my system and so now, the vignette that loads it is failing. Here's a condensed error output:

BiocManager::install("llrs/BaseSet", dependencies = TRUE, build_vignettes = TRUE)
E  creating vignettes (8.6s)
   --- re-buildingadvanced.Rmdusing rmarkdown
   
   Attaching package: 'dplyr'
   
.
.
.

   Quitting from lines 49-56 (advanced.Rmd) 
   Error: processing vignette 'advanced.Rmd' failed with diagnostics:
   there is no package called 'GO.db'
   --- failed re-buildingadvanced.Rmd--- re-buildingbasic.Rmdusing rmarkdown
   --- finished re-buildingbasic.Rmd--- re-buildingfuzzy.Rmdusing rmarkdown
   --- finished re-buildingfuzzy.RmdSUMMARY: processing the following file failed:advanced.RmdError: Vignette re-building failed.
   Execution halted
Error: Failed to install 'BaseSet' from GitHub:
  System command error, exit status: 1, stdout + stderr (last 10 lines):
E> --- finished re-buildingbasic.RmdE> 
E> --- re-buildingfuzzy.Rmdusing rmarkdown
E> --- finished re-buildingfuzzy.RmdE> 
E> SUMMARY: processing the following file failed:
E>advanced.RmdE> 
E> Error: Vignette re-building failed.
E> Execution halted

The reason this would not fail on your system is because you already have all the necessary package installed, but reviewers or new contributors won't necessarily. The reason I think the travis tests are passing is because of you are testing on bioc-devel. I can't think of why else installation on TRAVIS works while it doesn't for me. Testing on r: release feels like it would give you a more realistic situtation of what manual install steps would be required at any point by developers/ reviewers to be able to fully test and check the package.

Out of interest, you could try removing GO.db locally on your system and see what happens. For info, here's my session info too, ie I'm on a mac not windows.

> session_info()
─ Session info ───────────────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.6.0 (2019-04-26)
 os       macOS Mojave 10.14.3        
 system   x86_64, darwin15.6.0        
 ui       RStudio                     
 language (EN)                        
 collate  en_GB.UTF-8                 
 ctype    en_GB.UTF-8                 
 tz       America/Los_Angeles         
 date     2020-01-30Packages ───────────────────────────────────────────────────────────────────────────
 package     * version     date       lib source                             
 assertthat    0.2.1       2019-03-21 [1] CRAN (R 3.6.0)                     
 backports     1.1.5       2019-10-02 [1] CRAN (R 3.6.0)                     
 BiocManager   1.30.10     2019-11-16 [1] CRAN (R 3.6.0)                     
 callr         3.4.1       2020-01-24 [1] CRAN (R 3.6.0)                     
 cli           2.0.1       2020-01-08 [1] CRAN (R 3.6.0)                     
 clipr         0.7.0       2019-07-23 [1] CRAN (R 3.6.0)                     
 crayon        1.3.4       2017-09-16 [1] CRAN (R 3.6.0)                     
 curl          4.3         2019-12-02 [1] CRAN (R 3.6.0)                     
 desc          1.2.0       2018-05-01 [1] CRAN (R 3.6.0)                     
 devtools    * 2.2.1       2019-09-24 [1] CRAN (R 3.6.0)                     
 digest        0.6.23      2019-11-23 [1] CRAN (R 3.6.0)                     
 ellipsis      0.3.0       2019-09-20 [1] CRAN (R 3.6.0)                     
 evaluate      0.14        2019-05-28 [1] CRAN (R 3.6.0)                     
 fansi         0.4.1       2020-01-08 [1] CRAN (R 3.6.0)                     
 fs            1.3.1       2019-05-06 [1] CRAN (R 3.6.0)                     
 glue          1.3.1       2019-03-12 [1] CRAN (R 3.6.0)                     
 htmltools     0.4.0       2019-10-04 [1] CRAN (R 3.6.0)                     
 knitr         1.27        2020-01-16 [1] CRAN (R 3.6.0)                     
 magrittr      1.5         2014-11-22 [1] CRAN (R 3.6.0)                     
 memoise       1.1.0       2017-04-21 [1] CRAN (R 3.6.0)                     
 packrat       0.5.0       2018-11-14 [1] CRAN (R 3.6.0)                     
 pkgbuild      1.0.6       2019-10-09 [1] CRAN (R 3.6.0)                     
 pkgload       1.0.2       2018-10-29 [1] CRAN (R 3.6.0)                     
 prettyunits   1.1.1       2020-01-24 [1] CRAN (R 3.6.0)                     
 processx      3.4.1       2019-07-18 [1] CRAN (R 3.6.0)                     
 ps            1.3.0       2018-12-21 [1] CRAN (R 3.6.0)                     
 R6            2.4.1       2019-11-12 [1] CRAN (R 3.6.0)                     
 Rcpp          1.0.3       2019-11-08 [1] CRAN (R 3.6.0)                     
 remotes       2.1.0       2019-06-24 [1] CRAN (R 3.6.0)                     
 reprex      * 0.3.0       2019-05-16 [1] CRAN (R 3.6.0)                     
 rlang         0.4.3       2020-01-24 [1] CRAN (R 3.6.0)                     
 rmarkdown     2.1         2020-01-20 [1] CRAN (R 3.6.0)                     
 rprojroot     1.3-2       2018-01-03 [1] CRAN (R 3.6.0)                     
 rstudioapi    0.10.0-9000 2019-05-30 [1] Github (rstudio/rstudioapi@31d1afa)
 sessioninfo   1.1.1       2018-11-05 [1] CRAN (R 3.6.0)                     
 testthat    * 2.3.1       2019-12-01 [1] CRAN (R 3.6.0)                     
 usethis     * 1.5.1.9000  2020-01-29 [1] Github (r-lib/usethis@4194fd6)     
 whisker       0.4         2019-08-28 [1] CRAN (R 3.6.0)                     
 withr         2.1.2       2018-03-15 [1] CRAN (R 3.6.0)                     
 xfun          0.12        2020-01-13 [1] CRAN (R 3.6.0)                     

[1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

@llrs
Copy link
Member Author

llrs commented Feb 1, 2020

I've updated the package as requested on your initial review.

However, I think that the configuration for .travis.yaml that better matches CRAN is using bioc-release. Because otherwise it doesn't install the Bioconductor packages. I hope that with the new instructions to install the Bioconductor packages reviewers and users will be able to install all of its dependencies.

@annakrystalli
Copy link
Contributor

Hi @llrs,

Sorry, I probably didn't make myself clear enough. For sure you should be testing on bioc-release (perhaps on bioc-devel also). What I was suggesting is that you run a matrix of jobs that include testing on r-release and including any manual setup code during the build to both test that any awkward dependencies required for a reviewer/collaborator to work with the source code locally have:
a) been identified
b) their installation has been successfully specified.

Does that make sense? I've opened a PR to your repo with my attempt at implementing this. Let's see if it's successful!

@llrs
Copy link
Member Author

llrs commented Feb 4, 2020

Well, my impression was that CI should mimic the steps done on CRAN, not for the users. The steps are different, given that CRAN installs Bioconductor packages to be able to build packages which depend on Bioconductor.
That's why I modified the .travis.yml to only test on bioc-release, which is also what users need in order to install BaseSet. But it makes sense also to test the specific steps that users need to do.

I will try to modify the github actions to test also on OS, windows and Linux in both conditions to be sure.

@annakrystalli
Copy link
Contributor

annakrystalli commented Feb 10, 2020

Hello again @llrs 👋

I'm still looking for reviewers currently. I had a potential reviewer respond recently that sadly they did not have time at the moment but they also made the following comment:

the description in the Readme doesn't give me enough information to figure out what this package is about.

I generally agree with the comment and was imagining it would be picked up during review but I now think it's probably best to deal with it now as it might be making it difficult to find reviewers.

So could you add a little more detail to the README as well as the description in the DESCRIPTION file? In particular, the information you have included here (in the package submission issue) gives a good idea of the strengths and domains of the package which are not so clear currently in the README.

@llrs
Copy link
Member Author

llrs commented Feb 10, 2020

Hi @annakrystalli,

Thanks for sending the feedback, I appreciate it. I've edited both the DESCRIPTION and the README. Hope this way it will be easier to get reviewers.

@llrs
Copy link
Member Author

llrs commented Feb 14, 2020

@annakrystalli I've been experimenting with macOS builders and using BiocManager::install('llrs/BaseSet', dependencies = TRUE, build_vignettes = TRUE) it could install BaseSet and all its dependencies from Bioconductor without problems.
The test was on Catalina OS, CRAN checks the packages with El Captain, and you are using Mojave version. I'm not sure if I can find a CI with Mojave.

@noamross
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️

In the interest of reducing load on reviewers and editors as we manage the COVID-19 crisis, rOpenSci is temporarily pausing new submissions for software peer review for 30 days (and possibly longer). Please check back here again after 17 April for updates.

In this period new submissions will not be handled, nor new reviewers assigned. Reviews and responses to reviews will be handled on a 'best effort' basis, but no follow-up reminders will be sent.

Other rOpenSci community activities continue. We express our continued great appreciation for the work of our authors and reviewers. Stay healthy and take care of one other.

The rOpenSci Editorial Board

⚠️⚠️⚠️⚠️⚠️

@llrs
Copy link
Member Author

llrs commented Apr 27, 2020

About the installation problems there have been a discussion about problems installing packages from source and from binaries. See these messages from Bioc-devel mailing list: https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016689.html. In summary it seems that there is a bug in the logic of install.packages when mixing binary and source installs.

Just for future references

@annakrystalli
Copy link
Contributor

⚠️⚠️⚠️⚠️⚠️
In the interest of reducing load on reviewers and editors as we manage the
COVID-19 crisis, rOpenSci new submissions for software peer review are paused.

In this period new submissions will not be handled, nor new reviewers assigned.
Reviews and responses to reviews will be handled on a 'best effort' basis, but
no follow-up reminders will be sent.
Other rOpenSci community activities continue.

Please check back here again after 25 May when we will be announcing plans to slowly start back up.

We express our continued great
appreciation for the work of our authors and reviewers. Stay healthy and take
care of one other.

The rOpenSci Editorial Board
⚠️⚠️⚠️⚠️⚠️

@llrs
Copy link
Member Author

llrs commented Jun 3, 2020

@annakrystalli It's been a week since last message, and I haven't seen any plan announced here, on the website or on twitter. Is there a plan to start again looking for reviewers ?
PS: I am considering submitting directly to CRAN without waiting for a review from the volunteers of rOpenSci if it fits the new scope.

@annakrystalli
Copy link
Contributor

Hi @llrs,

Apologies for the slow motion on here. I've had a reviewer agree to review before lockdown but I'm still having trouble locating the second one. I should have communicated that here last week. Indeed if you have any suggestions for potential reviewers (that would not have a conflict of interest) they would be gratefully received.

Regarding CRAN I think it is fine to submit to CRAN independently of the rOoenSci review. You could always just push a new release to CRAN once the rOpenSci review is done.

As for publicising our return to reviewing, we are currently just gauging submission rates before we make a full announcement next week.

@llrs
Copy link
Member Author

llrs commented Jun 3, 2020

Hi @annakrystalli Thanks for the quick reply, hope that everything is fine at your place.

I don't know how important is to wait for the second reviewer, maybe the first one can start without waiting for the second one? There are some authors of related packages that I would appreciate their feedback even if they have some conflict of interest against the package (I pasted their Github username): RaphaelS1, kevinrue.

@annakrystalli
Copy link
Contributor

The conflict of interest is for us really (rOpenSci) to ensure we get a fair and unbiased review. In any case I could probably start the review with just a single reviewer. Let me just run all this past the rest of the editors and get back to you.

@annakrystalli
Copy link
Contributor

Success!!! We have now both our reviewers! 🥳🙌

Thanks so much for agreeing to review @arendsee and @j23414 👍


Reviewer: @arendsee
Reviewer: @j23414
Due date: 2020-06-29

@arendsee
Copy link

The following statement in the README is not right: "On fuzzy sets, elements have a probability to belong to a set".

In a probabilistic framework, every element is in a set or not in it. We are uncertain which sets an element is in, but the reality is still binary. In a fuzzy framework, the reality is vague. For example, say we are interested in which genes are highly expressed in a tissue. We could transform RNA-seq data for expression levels to a 0-1 scale, where 1 indicates the gene is very highly expressed, 0.6 is fairly highly expressed, 0.3 is weakly expressed, etc. The reality here is not binary.

If you use fuzzy methods on probabilistic data, as you do in the README and the fuzzy vignette, the results will be wrong except in very special cases. The probability that a particular element is in A or B is P(A) + P(B) - P(A and B). This is a law of probability. Application of the fuzzy operator to probabilistic terms would imply that P(A or B) = max(P(A), P(B)), which is correct only when P(B) = P(A and B) or P(A) = P(A and B). In other words, it is true only when one set contains the other set.

As @j23414 mentioned, the probability function (assuming independence) can be defined to work over fuzzy values (Smithson, 2006). Take the statement, "I am tall OR strong". Now say my tall stat is 0.3 and my strong stat is 0.6. The truth value of this statement is 0.6, using the max operator. This intuitively seems reasonable. Similarly, it makes intuitive sense that the statement, "I am tall AND strong" has a truth value of 0.3. We can do similar calculations using the probability function where, assuming independence, we get 0.6 + 0.3 - 0.3 * 0.6 = 0.72. This is a different truth value because we've redefined the fuzzy space by using a different operator. This different definition of fuzziness has some nice properties, so we may choose to use it for certain applications.

However, we don't get to define our own laws of probability. If we are given probabilistic data, then we have to use probabilistic operators. We can't freely change the data from membership probabilities to fuzzy truth values; these mean different things.

These distinctions need to be VERY clear in the README and vignette. If they aren't clear, a careless user could easily assume the operators are probabilistic and get completely wrong results.

@annakrystalli
Copy link
Contributor

Thank you @arendsee and @j23414 for your super timely and thorough reviews! Over to you now @llrs.

@llrs
Copy link
Member Author

llrs commented Jun 29, 2020

@arendsee Perhaps that sentence on the README is not suited for some logics or types of fuzzy sets. I am struggling to introduce the topic for people not familiar with fuzzy sets and at the same time support and explain what the package can do to more expert users. I'll carefully review the wording around the subject.

I agree that it is different to work with membership functions (Gene A 0.3 RNA levels means low, 0.6 means highly expressed) than with probabilities (Gene A is on 60% of cases highly expressed, whatever its value is). However, assigning a set from a measure is, nowadays, out of the scope of the package, this is either provided by the source of data or something the user should do.

About probabilities of A or B, it is still unclear to me over what do you calculate that probability or when would you like to do this. When I'll post a longer reply I'll provide some examples and questions about this, as I tend to understand something better with a practical question or exercise. On your example tall OR strong, you don't use your previous definition that "the probability that an element is in set A or set B". What other elements are present on the height and strength set (and what are their probabilities)? And why do you use height and strength stat as probability or truth value? They are different things, that's why membership functions are used, or are they already the fuzzy values?

While users should not use different logics on the same data the package aims to allow them to use whatever logic they want according to the knowledge they have about the data. But I can't guide them or provide feedback.

The fuzzy set theory extends the probability framework as explained on a source @j23414 provided:

... fuzzy researchers have gone to great pains to distance
themselves from probability. But in so doing, many of them have lost track
of another point, which is that the converse DOES hold: all probability
distributions are fuzzy sets! As fuzzy sets and logic generalize Boolean
sets and logic, they also generalize probability.


Hope to correct the points raised by @j23414 and post about the set theory and fuzzy-set operators that both reviewers had questions by the end of the week.

@j23414
Copy link

j23414 commented Jun 29, 2020

Hmm, that quoted source is new to me (Bezdek, 1993)... I provided links to the set theory chapter (Smithson 2006), membership value definition, and confidence value definition.

No matter : ) thanks for sharing the source! There definitely seem to be some murkiness to when and where is appropriate to use Fuzzy Sets.

For @arendsee's tall OR strong example, he seems to be illustrating the different calculation and final value in:

Case 1) if tall (0.3) and strong (0.6) are treated as fuzzy truth values:
then tall OR strong = max (tall, strong) = max(0.3, 0.6) = 0.6

Case 2) if tall (0.3) and strong (0.6) are treated as probabilities (and assume independence): **
then tall OR strong = P(tall) + P(strong) - P(tall and strong) = 0.3 + 0.6 - (0.3 x 0.6) = 0.72

**You can also think of this as a tall-ness gene where you are "more likely to be tall, but not guaranteed to be tall"

The union of probabilities (Case 2) is usually explained and illustrated as a venn diagram (link with explanation here).

@arendsee
Copy link

arendsee commented Jun 29, 2020

@arendsee Perhaps that sentence on the README is not suited for some logics or types of fuzzy sets. I am struggling to introduce the topic for people not familiar with fuzzy sets and at the same time support and explain what the package can do to more expert users. I'll carefully review the wording around the subject.

The first sentence in the README does not apply to the specific logic you are using in the package. That operator you are using will not return membership probabilities. Using that operator to infer union membership probability will give you a meaningless result.

About probabilities of A or B, it is still unclear to me over what do you calculate that probability or when would you like to do this.

I'm referring to the union operator as it is used in your fuzzy vignette:

set.seed(4567) # To be able to have exact replicates
relations <- data.frame(sets = c(rep("A", 5), "B", "C"),
                          elements = c(letters[seq_len(6)], letters[6]),
                          fuzzy = runif(7))
fuzzy_set <- tidySet(relations)
> fuzzy_set
  elements sets     fuzzy
1        a    A 0.2309186
2        b    A 0.7412554
3        c    A 0.1833834
4        d    A 0.4221682
5        e    A 0.5996399
6        f    B 0.2773313
7        f    C 0.7307312
> BaseSet::union(fuzzy_set, sets = c("C", "B"))
  elements sets     fuzzy
1        f  C∪B 0.7307312

The union function is intended to calculate the probability that f is in C or B.

P(f in B) = 0.277
P(f in C) = 0.731

Now, from the laws of probability: P(f in (B or C)) = 0.277 + 0.731 - P(f in (B and C))
If we assume independence (which is VERY dicey), then P(f in (B or C)) = 0.277 + 0.731 - 0.277 * 0.731

Your union function calculates the wrong probability.

If you want to use the current logic, then you need to strongly state that the fuzzy values are NOT probabilities.

For @arendsee's tall OR strong example, he seems to be illustrating the different calculation and final value in:

Case 1) if tall (0.3) and strong (0.6) are treated as fuzzy truth values:
then tall OR strong = max (tall, strong) = max(0.3, 0.6) = 0.6

Case 2) if tall (0.3) and strong (0.6) are treated as probabilities (and assume independence): **
then tall OR strong = P(tall) + P(strong) - P(tall and strong) = 0.3 + 0.6 - (0.3 x 0.6) = 0.72

@j23414 I think I didn't explain my point here clearly. The tall or strong example was purely fuzzy. The data is fuzzy and probabilistic methods do not apply. I was describing two options for operators over fuzzy data. The first is the "standard" fuzzy operator that is currently used in BaseSet. The second is another fuzzy operator that happens to be the same as the probability of independent events, but the interpretation is different. Specifically, 0.72 is not the probability that the statement, "I am tall or strong", is true, rather it is a measure of how true the statement is. These two cases show two artificial metrics for fuzziness.

My point was that there are different fuzzy union operators. One of the fuzzy operators looks like the probabilistic operator for independent events. But just because this one operator can be used for set membership (assuming independence), does not mean any fuzzy operator can be used. There are many fuzzy metrics, but only one correct probability.

We don't get to choose whether data we are given is probabilities or fuzzy truth values.

While users should not use different logics on the same data the package aims to allow them to use whatever logic they want according to the knowledge they have about the data. But I can't guide them or provide feedback.

The purpose of the README and vignettes is to guide them. But your examples, both in the README and vignettes, incorrectly use fuzzy operators to infer the probabilities of memberships in a set union.

The fuzzy set theory extends the probability framework as explained on a source @j23414 provided:

... fuzzy researchers have gone to great pains to distance
themselves from probability. But in so doing, many of them have lost track
of another point, which is that the converse DOES hold: all probability
distributions are fuzzy sets! As fuzzy sets and logic generalize Boolean
sets and logic, they also generalize probability.

Yes, fuzzy set theory is MORE general than probability. This means that probability is a special case of fuzzy logic. Thus a specific logic within fuzzy set theory applies to probabilistic systems. This does not mean that probability theory equals fuzzy theory. And it emphatically does not mean that any fuzzy logic applies to probability. Specifically, the fuzzy operator you are using for unions is not the same as the probability operator.

in probability theory: P(A or B) = P(A) + P(B) - P(A and B)
in standard fuzzy theory: F(A or B) = max(F(A), F(B))

P(A or B) is not equal to F(A or B), therefore you simply can't use that operator to calculate the probability of membership in a set union.

Maybe we should invite a statistician and/or a fuzzy set theorist to comment.

@annakrystalli
Copy link
Contributor

Hello all. It looks like some guidance from a statistician might indeed be a good idea. This is technically not part of package review but I'll see what other editors think and let you know what we can do.

@llrs
Copy link
Member Author

llrs commented Jul 13, 2020

Review 2:

I'll only reply to comments or points that weren't checked or there were some substantial comments:

  • Vignette(s) demonstrating major functionality that runs successfully locally
    Now the vignettes run succesfully at least on windows, ubuntu, and I think the problems on macOs of the CI are not related to the vignette.

  • Examples for all exported functions in R Help that run successfully locally
    Corrected examples

  • Automated tests: Unit tests cover essential functions of the package
    Fixed the failing tests and added some more

  • Functionality: Any functional claims of the software been confirmed.

  • Pluralized arguments of set_size and element_size to match the expected name of columns

  • Modified the examples so that sets are represented by upper case letters and elements by lower case letters.

  • About the weird behavior of set_size()
    Your first try to modify the set and convert to a fuzzy set you'll need to assign the modified ecoli_set to be able to use the fuzzy values (or pipe it as shown below). The TidySet is of class S4, not S6, and the modifications are not in place.

ecoli_sets %>%
      mutate(
      fuzzy = case_when(sets == "GL" ~ 0.2,    # Add fuzzy values
      sets == "CF" ~ 0.8)) %>%
      set_size()

Yes, the GL has 0.04 probability to have two genes so the glycolysis pathway probably don't have any genes. That's is if we assume that the fuzzy values are probabilities, if they where other things it wouldn't be accurate. I think this is different from cardinality, as cardinality it is just a number for a set, while here I return several values for a single set. I added to the documentation how is this calculated (via length_set).

I provided a new method cardinality to calculate it and allow to provide a fuzzy logic function. The {sets} package mentioned previously provides cardinality as the number of elements in a set regardless of the fuzzy value. However, in this case the default is sum as you found, but length as in {sets} or other functions could be provided.

  • Vignette comments

Changed the automatic names for the full names, also added some links for references to the data and how to interpret it. I tried to use some fuzzy values on the advanced vignette, but it would take too long to calculate for many sets and would make the package fail the checks.


Fuzzy logic & explanation

Many thanks for all the feedback on this, I realized that I assumed too many things. I modified or added explanations about the fuzzy values around the package.

I modified the README section about fuzzy sets, now it reads : "Fuzzy sets are similar to classical sets but there is some vagueness on the relationship between the element and the set.".

Similarly I've added a new section on the README about why this package was developed and why it could be useful to others. Also introduced the topic on the fuzzy vignette about what are they and some links.

I do not want to force an interpretation of the fuzzy theory which sometimes translate to using different semantics: membership, probabilities, truth vales, and others. I think that the fuzzy column is the most neutral and better understood. There seems to be a movement from fuzzy research to differentiate it from probabilities, but at the same time to expand them(probabilities).

One can use any logic according to the specific framework one wish to work. This in encouraged through the package by being able to use any logic. To show the flexibility of the package using the same data I've added some examples with a different logic explaining when this logic could be applied. Hope now it is clear that the package does not forces to use a certain logic.

The defaults parameters of the package are meant to provide sensible results with fuzzy sets but to not give surprises on classical sets. On the specific case of union I added an example on the documentation that I hope it makes it clear how to use other logics outside the default for the specific case of probabilities, union(TS, sets = c("A", "B"), FUN = function(x){y <- sum(x)-prod(x); ifelse(length(x)==1, x, y)}).

Given all the discussion around how to interpret the fuzzy sets, I would appreciate your thoughts on when do fuzzy sets arise on practical cases. How do imagine a case where there are multiple cases which might be related or not to a set? For instance imagine I want to record some cards on an intersection; the first blue car goes right, second blue car goes right, third blue car goes left. Would you record as a fuzzy/probability set (set being here right/left) where a blue car has 0.66 relationship with going right and 0.33 going left? Or as a classical set with all three cars two of them on the right set and one on the left set? If it is a fuzzy set, then which logic would you apply?

@arendsee
Copy link

The term sum(x) - prod(x) is only equal to the probability of the union when x has length 2. Calculating the union probabilities is more complicated for more than 2 sets. For 3 sets it is P(A or B or C) = P(A) + P(B) + P(C) - P(A and B) - P(B and C) - P(C and A) + P(A and B and C). I recommend adding test cases for sets of 2, 3 and 4 (at least).

Think about the case of three sets with probability of membership for each being 1. The probability of the union would also be 1. But sum(x) - prod(x) == 3 - 1 == 2, which of course isn't a valid probability. In general, the union of any vector of numbers between 0 and 1 should also be between 0 and 1 (you could write this into a unit test).

I don't think it is necessary for initial acceptance into rOpenSci, but you should eventually consider how to relax the independence assumption. I'd definitely recommend talking to a statistician (statisticians are great).

I think fuzzy sets have lots of applications within bioinformatics. For example, someone might want to know whether metabolite A and metabolite B both have high concentrations in a given tissue. You could scale the molar concentrations to values between 0 and 1 where 0.5 is the "normal" value.

@llrs
Copy link
Member Author

llrs commented Jul 14, 2020

@arendsee I added a helper function union_probability to calculate the union probabilities for an unlimited size (only tested for 3 and 4 though). However I feel that this is out of the scope of the package.

The validation of the TidySet class includes a test that check that fuzzy values are between 0 and 1 (both included). There is no need to add a unit test as it is automatically checked with each exported method (via validObject).

As previously said I don't assume independence or dependence of sets in the package, the package doesn't even assume that the fuzzy sets are probabilities! This is and will be the responsibility of the user, mainly because from the data stored the package can't make any educated guesses about the type of the relationships between sets. I'll ask a statistician about relaxing independence between sets.

I agree that they have lots of applications within bioinformatics, that's why I wrote the package: to make it easier to use them. Thank you for all the feedback to make it even easier to use it.

What do you think of the questions on the last paragraph on my previous comment?

@llrs
Copy link
Member Author

llrs commented Jul 28, 2020

Hi @annakrystalli, @j23414, @arendsee.
Is there anything else I can do to proceed forward?

@annakrystalli
Copy link
Contributor

Hi @llrs I believe you had requested a response to one of your questions by @arendsee ?

If both @arendsee and @j23414 are happy with your responses to their reviews, I am happy to conclude this review

@llrs
Copy link
Member Author

llrs commented Jul 28, 2020

Yes, along the review I made several questions which I didn't get an answer yet.

Also I don't know what was decided about reaching a statistician (from above):

Hello all. It looks like some guidance from a statistician might indeed be a good idea. This is technically not part of package review but I'll see what other editors think and let you know what we can do.

@j23414
Copy link

j23414 commented Jul 28, 2020

Plan to take a look again this week, thanks for the changes. And thanks @annakrystalli for organizing

@arendsee
Copy link

I'm double-checking some of the math. I'll get back with you in a bit.

@arendsee
Copy link

OK, the math checks out. Also, the extended vignette clarifies the fuzziness relationship.

I think this package will have a good future. There are a lot of neat directions you could take it. I think it will be good to share it with the community and get feedback. The package may incite a lot of debate, which is good. Once you and the community have hammered out the wrinkles and chosen the features and interface that is "best", maybe you can roll out a non-backwards compatible BaseSet2 package.

So, my final thoughts: the package is useful as is and will get much better as an active community grows around it.

@j23414
Copy link

j23414 commented Aug 3, 2020

The new version passes checks so far... took a while to build the advanced vignette but I can see that fuzzy values were added (good). Glad to see more explanations and links in README, Fuzzy, and Advanced. For set_size, the new cardinality function matches what I'd expect in fuzzy sets.

In summary, can see how the package may be useful. Agree that it can continue to be refined with an active community.

@annakrystalli
Copy link
Contributor

Approved! Thanks @llrs for submitting and @arendsee & @j23414 for your reviews! 🎉 🚀

To-dos:

  • Transfer the repo to rOpenSci's "ropensci" GitHub organization under "Settings" in your repo. I have invited you to a team that should allow you to do so. You'll be made admin once you do.
  • Fix all links to the GitHub repo to point to the repo under the ropensci organization.
  • If you already had a pkgdown website and are ok relying only on rOpenSci central docs building and branding,
    • deactivate the automatic deployment you might have set up
    • remove styling tweaks from your pkgdown config but keep that config file
    • replace the whole current pkgdown website with a redirecting page
    • replace your package docs URL with https://docs.ropensci.org/package_name
    • In addition, in your DESCRIPTION file, include the docs link in the URL field alongside the link to the GitHub repository, e.g.: URL: https://docs.ropensci.org/foobar (website) https://github.com/ropensci/foobar
  • Fix any links in badges for CI and coverage to point to the ropensci URL. We no longer transfer Appveyor projects to ropensci Appveyor account so after transfer of your repo to rOpenSci's "ropensci" GitHub organization the badge should be [![AppVeyor Build Status](https://ci.appveyor.com/api/projects/status/github/ropensci/pkgname?branch=master&svg=true)](https://ci.appveyor.com/project/individualaccount/pkgname).
  • We're starting to roll out software metadata files to all ropensci packages via the Codemeta initiative, see https://github.com/ropensci/codemetar/#codemetar for how to include it in your package, after installing the package - should be easy as running codemetar::write_codemeta() in the root of your package.

Should you want to acknowledge your reviewers in your package DESCRIPTION, you can do so by making them "rev"-type contributors in the Authors@R field (with their consent). More info on this here.

Welcome aboard! We'd love to host a post about your package - either a short introduction to it with an example for a technical audience or a longer post with some narrative about its development or something you learned, and an example of its use for a broader readership. If you are interested, consult the blog guide, and tag @stefaniebutland in your reply. She will get in touch about timing and can answer any questions.

We've put together an online book with our best practice and tips, this chapter starts the 3d section that's about guidance for after onboarding. Please tell us what could be improved, the corresponding repo is here.

@annakrystalli
Copy link
Contributor

Thanks for transferring @llrs ! I've made you admin again 👍

@llrs
Copy link
Member Author

llrs commented Aug 4, 2020

Many thanks @annakrystalli for approving the package. I already transferred the package and hope to finish all the other steps either today or tomorrow.

Soon I'll begin my holidays and won't be able to contribute with a post for the blog, but I'd like to write one after September @stefaniebutland. But I'm also hoping to present on the Rstudio conference on January... not sure when I'll manage to write it up.

@annakrystalli
Copy link
Contributor

Many thanks @annakrystalli for approving the package. I already transferred the package and hope to finish all the other steps either today or tomorrow.

No problem. Just ping when you are done. And enjoy your holidays! Well deserved 😎🏝

@stefaniebutland
Copy link
Member

I'd like to write one after September @stefaniebutland. But I'm also hoping to present on the Rstudio conference

@llrs I just saw your tweet about your RStudio talk! This is wonderful. I'll mark my calendar to contact you in late September about a post.

Enjoy your holidays!

@llrs
Copy link
Member Author

llrs commented Aug 5, 2020

@annakrystalli I think I finished with all the required changes. Let me know if I missed anything. (I haven't submitted to CRAN yet. I'll wait until another submitted package is accepted to submit BaseSet to CRAN)

@annakrystalli
Copy link
Contributor

Hey @llrs, everything looks good to me. A couple of things to note:

  • A couple of badges are showing as unknown (even though it looks like you've correctly set the new URLs). Might want to trigger a build to reanimate them.
  • The CMD-CHECK GitHub Action & Travis seem to be failing. You might want to look into that.

Otherwise, I'm happy to conclude the review and close the issue. Now it's time for my holiday too! 😎🏝🍹

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants