-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BF: Stream sampler #715
BF: Stream sampler #715
Conversation
Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py). The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances. However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number. What do you think about the current implementation? Btw, the documentation is at the moment still missing. Thank you! |
Hi @steveyuwono, I had some time to review your code. Clean as always! Here are some things I found:
Hope that helps... |
Hi @detlefarend , thank you for reviewing this. Concerning the filter_instance method, I will just rename this method to omit_instance. I will update related codes accordingly as well as the refactoring of the native stream for csv_files, as discussed in #716. |
Hi @detlefarend , I have updated the sampler according to our discussion as well as an independent stream csv_file from mlpro stream provider, as in #716. However, to support the big data in this matter, more discussions are necessary. Should we create an issue for this or should keep #716 open? If you have time, you can review this. Thank you Moreover, I found out that the automodule of our RTD's API reference section does not generate complete functionalities of each module, see the pictures below as an example. The same story happens in most modules. I have not found the causes, but I will have a look at them. Shall I also create an issue for this? |
Yes, I'll take a look and let you know.
Did you compare the local and online version of rtd? Is it a problem on both derivates? Yes, please create an issue for it and involve Rizky. Maybe, he has an idea or quick solution... |
It is missing the |
Ahh okay, thank you. I will fix those |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Steve, I just reviewed it. Here are my findings:
- Class StreamMLProCSV
- Could you please add a parameter description on the top level? Just the specific ones.
- Method _init_dataset(): should load the data only at the first call
- Yes, let's add a separate issue for the big data extension.
- Since we decided to make this stream a standalone one, I propose to make parameters p_csv_file and p_path_load mandatory - even on constructor level. Then, we could say: C_TYPE = 'Stream csv file' and C_NAME = p_csv_file.
- RTD bf.stream
Could you please add a brief description for the sampling systematics?
The howtos run properly. Refact of Classes Sampler, SamplerRND goods good.
Furthermore, I changed/refactored some minor things on class Stream. No incompatible changes, more a small clean-up. See code log.
Hi Detlef, thank you for your reviews. I have updated the StreamMLProCSV and our RTD related to the sampler. I also added three other simple sampling methods to the pool and validate them using howto_005, such as weighted random sampler, min-wise sampler, and reservoir sampler. If everything is fine, I will merge this to the main. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Steve, I checked the last changes. The classes look good now. I just changed the exception type in StreamMLProCSV.set_options() from NotImplementedError to ParamError. I also added a Cross reference to the rtd description and refactored the rst files a bit. So far, so good.
Regarding howto_bf_streams_005 I propose some changes to simplify it and to enable unit testing of the sampler functionality:
- Native generated stream (e.g. Rnd10Dx1000) instead csv
- Enabling of backend processing
After that please merge to main directly.
Okay great! I will simplify howto 005 accordingly and merge it into the main branch. Thank you! |
Description
Background
Checklists: