about data sets that are used in samples, tests, docs, etc. #276
Replies: 3 comments 8 replies
-
Thank you for bringing up this topic. Since we started testing the functionalities of fsharp.stats we used various ways of importing the testing data. I think now we have reached a status where one could generalize the data import. I recently talked with @mathias-brandewinder and @kMutagene about the lack of a place in fslab for standard datasets. Actually there is a dedicated repository (https://github.com/fslaborg/datasets), but it needs to be overhauled. The datasets could be offered here both as record collection, or as readily formated data frames. |
Beta Was this translation helpful? Give feedback.
-
Started to play with this idea a bit, focusing on the record collection direction:
|
Beta Was this translation helpful? Give feedback.
-
I've been pondering this a bit on my own, but my current focus is on a solar tracker project (WildernessLabs/IoT). With that caveat - my thinking 🧠 where I'm really of two minds:
Depending on your perspective (and previous experience) either option might be viewed as "one stop shopping" while the other could be seen as an adverse burden. |
Beta Was this translation helpful? Give feedback.
-
In recent fsharpconf, there was talk "My Leap from R to F#" from Beth Milhollin and she highlights how great for data scientist communities to use commonplace datasets, how they were part of ecosystems of packages in R (and same applies in python AFAICS) and supporting for learning various packages in each ecosystem.
In this repository, I think there is already few datasets, and the engineering challenges related to:
I'm overall thinking this repository should take those concerns in consideration, and that we start to explore with something that address the primary needs of the maintenance of it, and making some of the samples basically work out of the box without having to resort to tweaking file paths, or running FSI from a different directory, etc.
One first approach for us would be to have a build step (that I'll call a script for now) that would scan the folders where it is agreed by convention the "packaged dataset" will reside for the build & packaging time, compare this scan result against a dataset that describes each dataset.
When we add a dataset, the script would:
We would then have a code generation step (or we figure a solution that relies on type provider, but since it incurs runtime cost into the IDE and significant engineering challenges to make those, I'd not favor this) that prepares a F# file that knows how to load the data, into the expected / supported runtime format defined for it.
This approach is just one idea, the end game would be to have a separate "datasets" nuget package.
I can't commit time for this, but I'd like to get the community to pool their ideas and get inputs from maintainers about this topic being a maintenance chore or not much of an issue, so far this repository has progressed (a lot!), what are the pain points and considerations?
Beta Was this translation helpful? Give feedback.
All reactions