about data sets that are used in samples, tests, docs, etc. #276

smoothdeveloper · 2023-07-06T15:40:12Z

smoothdeveloper
Jul 6, 2023

In recent fsharpconf, there was talk "My Leap from R to F#" from Beth Milhollin and she highlights how great for data scientist communities to use commonplace datasets, how they were part of ecosystems of packages in R (and same applies in python AFAICS) and supporting for learning various packages in each ecosystem.

In this repository, I think there is already few datasets, and the engineering challenges related to:

making sure it is easy to reference it in samples (and actually run thoses from the IDE when editing the docs)
making sure each file is loaded per a certain protocol (e.g. to turn it into a deedle data frame with the appropriate index types, or other formats that are frequently used in data exploration settings)
attaching more metadata to the sample data (domain description for the column names, description of classification values, attribution and references, pointers to research papers, etc.) in forms of literal values, or even, actual F# data structures

I'm overall thinking this repository should take those concerns in consideration, and that we start to explore with something that address the primary needs of the maintenance of it, and making some of the samples basically work out of the box without having to resort to tweaking file paths, or running FSI from a different directory, etc.

One first approach for us would be to have a build step (that I'll call a script for now) that would scan the folders where it is agreed by convention the "packaged dataset" will reside for the build & packaging time, compare this scan result against a dataset that describes each dataset.

When we add a dataset, the script would:

fail asking the maintainer to add the entry in this "dataset metadata" dataset OR create a default entry for detected file (that can be later fine tuned)
it would fail for missing files

We would then have a code generation step (or we figure a solution that relies on type provider, but since it incurs runtime cost into the IDE and significant engineering challenges to make those, I'd not favor this) that prepares a F# file that knows how to load the data, into the expected / supported runtime format defined for it.

This approach is just one idea, the end game would be to have a separate "datasets" nuget package.

I can't commit time for this, but I'd like to get the community to pool their ideas and get inputs from maintainers about this topic being a maintenance chore or not much of an issue, so far this repository has progressed (a lot!), what are the pain points and considerations?

bvenn · 2023-07-07T05:45:37Z

bvenn
Jul 7, 2023
Maintainer

Thank you for bringing up this topic. Since we started testing the functionalities of fsharp.stats we used various ways of importing the testing data. I think now we have reached a status where one could generalize the data import.

I recently talked with @mathias-brandewinder and @kMutagene about the lack of a place in fslab for standard datasets. Actually there is a dedicated repository (https://github.com/fslaborg/datasets), but it needs to be overhauled. The datasets could be offered here both as record collection, or as readily formated data frames.

0 replies

mathias-brandewinder · 2023-07-08T20:59:06Z

mathias-brandewinder
Jul 8, 2023
Maintainer

Started to play with this idea a bit, focusing on the record collection direction:
https://github.com/mathias-brandewinder/TypedDatasets
Take this repository as a quick-and-dirty first sketch. My idea here is to

make it easy to use/consume a dataset,
make it easy to add a new dataset,
beyond data, provide some source / attribution information as well

3 replies

smoothdeveloper Jul 21, 2023
Author

@mathias-brandewinder do you agree that between the raw data and a typed model, there is a lot that occurs that involves typing / evaluating F# code?

I mean in terms of exploration of the data at a lower level, in context of datasets that are a bit more complex than "school cases" (faux ami?).

I believe there is value in putting this in perspective, if such effort consolidates; I mean having articles / documentation that would explain to people learning this art, steps used, to end up with the typed model.

I know for simple datasets made of a single csv, it boils down to not that much code (https://github.com/mathias-brandewinder/TypedDatasets/blob/5208cea12e1fb946679a3ee04e945bde35e73510/src/Iris.fs#L64-L72) but I'm trying to jauge if there is valuable practices & guidelines that could be fleshed out, to explain how to do similar on more complex and possibly faulty/imperfect datasets.

smoothdeveloper Jul 22, 2023
Author

Checked Beth's talk again, and it is really up for discussion if "from tidyverse to sleekverse" is a straight path or multi-hop, and the consideration about "data exploration is still more difficult in F# right now" (which is for many reasons, but I believe some are related to the fact it is muli-hop process, and tons of small items of work yet to happen in the eco system of established libraries).

kMutagene Jul 24, 2023
Maintainer

I mean in terms of exploration of the data at a lower level, in context of datasets that are a bit more complex than "school cases" (faux ami?).

This goes a bit in the same direction as my points in fslaborg/datasets#9, especially this comment.

houstonhaynes · 2023-07-21T12:31:29Z

houstonhaynes
Jul 21, 2023

I've been pondering this a bit on my own, but my current focus is on a solar tracker project (WildernessLabs/IoT). With that caveat - my thinking 🧠 where I'm really of two minds:

I come from an R background as much as C# - and from that perspective was thinking an update to the R type provider (to be compatible with R 4.1+) might be a way to by extension have access to whatever's available in that ecosystem. The reasoning is to have a form of "one stop shopping" - a way to ensure a form of repeatability between R and F# because it's accessing the same data source.
I also like the idea of having things made more accessible from a "native" F# point of view. That eliminates the additional cognitive overhead having to understand R conventions well enough to crawl through that ecosystem to get to the data. (and of course eliminating the need to have R bits on the machine).

Depending on your perspective (and previous experience) either option might be viewed as "one stop shopping" while the other could be seen as an adverse burden.

5 replies

smoothdeveloper Jul 21, 2023
Author

Based on @mathias-brandewinder feedback and what you call "one stop shopping", I see value in having same datasets easily exposed in different fashions, so the "shop" is not opinionated on how people want to consume the data, nor what stage the data is at.

F# model would be when there is aspect such as nesting and relational dataset, usage of units of measures, dehydrated as records / DUs, which is generally the end result of lots of sweat, dealing with lower level data
lower level data, expose the datasets into the several views that make sense in context of the fslab organization:
** FSharp.Data.CsvProvider or similar
** Deedle dataframe (there is also the port of Pandas: https://github.com/SciSharp/Pandas.NET, might be good to consider as well)
** R datasets exposed through the runtime types of R Type Provider

There are also similar datasets in the Python ecosystem.

while the other could be seen as an adverse burden.

I see opportunity for the community to design a good infrastructure that leaves all options open, and makes it easy to maintain & contribute to (concerns I've put in the top post, attributions that Mathias brings, etc.).

Given the diverse contexts "data processing" entails, there are several stages before things are ultimately fleshed into a pure F# model; the ecosystem of libraries brings the tools that help analysts get toward that.

Highlighting the diverse contexts and stages, in technical writing, seems relevant, while just having the end result (pure F# model) doesn't bring the full overview of the process IMO.

houstonhaynes Jul 22, 2023

I was afraid to suggest that simply because the scope might cause reasonable people to pause. 😬😉

I definitely think there should be more than one option, but im circling back to debate my own thesis. Given the (small) number of people that may be seeking to replicate results from data available in R (or Python) I wonder if that should be set aside until it's a named objective. I imagine it will come down to a question of priority.

Perhaps others have already had this conversation with themselves, and I'm just catching up now that I've turned to it. Either way it could be a good exercise to put these points "on a board" for later reference. That way new folks joining mid-stream will have the rationale to review and catch up quickly - or bring a fresh perspective to weigh in. And I'm sure that personality types that emphasize rigor in their work will appreciate it. 😆

smoothdeveloper Jul 22, 2023
Author

I have high trust on entropy, gradual improvement, and I'm not the best for rigour (but to use or attempt to design tools of the pit of success that makes it easy to "seem" rigorous, or obtain similar results as "rigorous" without mandating special efforts).

I've laid out few concerns that impact the maintenance of this library in my top post, which were the prompting factor, and opened discussion for first stages of brainstorming, and gathering the more disciplined and apt people 🙂.

If it was my job to design stuff, I'd:

keep accruing requirements/concerns
possibly get acknowledgement for subset of those that I brought in initial post, as being significant and headed in right direction
sumarize it better in documents to validate approval
iterate on design until it yields fruits that are easy to use in FSI and projects, with no concern of filesystem, out of box

But this is large effort (especially about curating data, needs data scientists / domain experts), for it to turn into production grade, I can't commit to any of this but than good intents on this focal point.

smoothdeveloper Jul 22, 2023
Author

Given the (small) number of people that may be seeking to replicate results from data available in R (or Python)

I think the set of those people is an increasing function and all options are open, until there is consensus that we'd need to close them :)

When I hear the talk from Beth, I'm a bit concerned with the "sleekverse is all there is", however this is aligned with F# as we know today, the point is also to bridge the gap to cross, that Beth was explaining in her talk. Not everyone needs to reach "sleekverse" by being "shaken off" from the more "fuzzy" way it is done in python & R, especially if we put some weight on the adoption being a forcing function for the ecosystem to fall into place.

We ought to keep those options, as I believe, they also address the "data exploration is more difficult in F# as of now".

kMutagene Jul 24, 2023
Maintainer

I come from an R background as much as C# - and from that perspective was thinking an update to the R type provider (to be compatible with R 4.1+) might be a way to by extension have access to whatever's available in that ecosystem. The reasoning is to have a form of "one stop shopping" - a way to ensure a form of repeatability between R and F# because it's accessing the same data source.

While I see the use cases and think that the work done on the type provider is absolutely amazing, I think the over-reliance on out-of-ecosystem stuff in the past has made F# less attractive for data science (and over all) although it is in my opinion perfectly suited when looking at the language itself. Efforts to make F# better in the data science domain should in my opinion therefore be focused on implementations in F#. That also means natively implementing things that are theoretically "there for the taking" in other languages via things like type providers. I mean, there are ways of calling python in .NET as well. But then, why not just do things in python? F# has unique strengths that are naturally used best in native F# code. I think a good example where this worked well is Plotly.NET, where the core implementation is 100% pragmatic F#, and the library is even used to a large extend in C#. FSharp.Stats is shaping out well too, but has some baggage due to old power pack implementations.

But if you compare the current state to, say 4 years ago, you can definitely see this

[...] entropy, gradual improvement [...]

working its way. As the niche is still pretty small though, there are not many people that dedicate much time to core libraries. I think there is potential for this to change, as I sense heightened interest for example via the data science in fsharp conference that is taking place in September, where I think much of this will be discussed on the fslab panel and hackathon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about data sets that are used in samples, tests, docs, etc. #276

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

about data sets that are used in samples, tests, docs, etc. #276

smoothdeveloper Jul 6, 2023

Replies: 3 comments · 8 replies

bvenn Jul 7, 2023 Maintainer

mathias-brandewinder Jul 8, 2023 Maintainer

smoothdeveloper Jul 21, 2023 Author

smoothdeveloper Jul 22, 2023 Author

kMutagene Jul 24, 2023 Maintainer

houstonhaynes Jul 21, 2023

smoothdeveloper Jul 21, 2023 Author

houstonhaynes Jul 22, 2023

smoothdeveloper Jul 22, 2023 Author

smoothdeveloper Jul 22, 2023 Author

kMutagene Jul 24, 2023 Maintainer

smoothdeveloper
Jul 6, 2023

Replies: 3 comments 8 replies

bvenn
Jul 7, 2023
Maintainer

mathias-brandewinder
Jul 8, 2023
Maintainer

smoothdeveloper Jul 21, 2023
Author

smoothdeveloper Jul 22, 2023
Author

kMutagene Jul 24, 2023
Maintainer

houstonhaynes
Jul 21, 2023

smoothdeveloper Jul 21, 2023
Author

smoothdeveloper Jul 22, 2023
Author

smoothdeveloper Jul 22, 2023
Author

kMutagene Jul 24, 2023
Maintainer