BF: Stream sampler #715

steveyuwono · 2023-04-10T11:54:23Z

Description

Background

Checklists:

Read the CONTRIBUTION guide (required).
Update the history on the source code (required).

steveyuwono · 2023-04-10T15:48:31Z

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).

The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances.

However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.

What do you think about the current implementation? Btw, the documentation is at the moment still missing.

Thank you!

detlefarend · 2023-04-11T19:46:56Z

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).

The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances.

However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.

What do you think about the current implementation? Btw, the documentation is at the moment still missing.

Thank you!

Hi @steveyuwono, I had some time to review your code. Clean as always! Here are some things I found:

Class Stream, constructor:
I would rather let the customer instantiate/parameterize the sampler outside the stream class. Then it's just an optional object in parameter p_sampler and you can initialize the related attribute like self._samper = p_sampler. I think, custom method _set_sampler() is not necessary. The p_kwargs are then clearly related to the stream itself and not the sampler.
Class Stream: new public method set_sampler()
I recommend to add this method. You can use it in the constructor. Furthermore it allows to set a sampler after instantiation of a stream - even if it is taken from a provider.
Method Sampler.filter_instance
My intuition tells me, a return value True means, the instance shall be filtered (=ignored). Others could think: filtering is not ignoring but passing through. To make it more clear, the method could be named skip_instance. What do you think?
Class SamplerRND
I think I would try to simplify it and make it time-independent. And I recommend a small change on the index logic: let self._nxt_idx be an int random value from 1 to n and self._idx a value from 0 to self._nxt_idx. That avoids an integer overflow in big data scenarios (24x7 real operation of whatever stream data).

Hope that helps...

…/sampler

steveyuwono · 2023-04-13T08:23:32Z

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).
The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances.
However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.
What do you think about the current implementation? Btw, the documentation is at the moment still missing.
Thank you!

Hi @steveyuwono, I had some time to review your code. Clean as always! Here are some things I found:

Class Stream, constructor:
I would rather let the customer instantiate/parameterize the sampler outside the stream class. Then it's just an optional object in parameter p_sampler and you can initialize the related attribute like self._samper = p_sampler. I think, custom method _set_sampler() is not necessary. The p_kwargs are then clearly related to the stream itself and not the sampler.

Class Stream: new public method set_sampler()
I recommend to add this method. You can use it in the constructor. Furthermore it allows to set a sampler after instantiation of a stream - even if it is taken from a provider.

Method Sampler.filter_instance
My intuition tells me, a return value True means, the instance shall be filtered (=ignored). Others could think: filtering is not ignoring but passing through. To make it more clear, the method could be named skip_instance. What do you think?

Class SamplerRND
I think I would try to simplify it and make it time-independent. And I recommend a small change on the index logic: let self._nxt_idx be an int random value from 1 to n and self._idx a value from 0 to self._nxt_idx. That avoids an integer overflow in big data scenarios (24x7 real operation of whatever stream data).

Hope that helps...

Hi @detlefarend , thank you for reviewing this. Concerning the filter_instance method, I will just rename this method to omit_instance. I will update related codes accordingly as well as the refactoring of the native stream for csv_files, as discussed in #716.

…/sampler

steveyuwono · 2023-04-14T14:47:35Z

Hi @detlefarend , I have updated the sampler according to our discussion as well as an independent stream csv_file from mlpro stream provider, as in #716. However, to support the big data in this matter, more discussions are necessary. Should we create an issue for this or should keep #716 open? If you have time, you can review this. Thank you

Moreover, I found out that the automodule of our RTD's API reference section does not generate complete functionalities of each module, see the pictures below as an example. The same story happens in most modules. I have not found the causes, but I will have a look at them. Shall I also create an issue for this?

detlefarend · 2023-04-14T17:45:13Z

Hi @detlefarend , I have updated the sampler according to our discussion as well as an independent stream csv_file from mlpro stream provider, as in #716. However, to support the big data in this matter, more discussions are necessary. Should we create an issue for this or should keep #716 open? If you have time, you can review this. Thank you

Yes, I'll take a look and let you know.

Moreover, I found out that the automodule of our RTD's API reference section does not generate complete functionalities of each module, see the pictures below as an example. The same story happens in most modules. I have not found the causes, but I will have a look at them. Shall I also create an issue for this?

Did you compare the local and online version of rtd? Is it a problem on both derivates? Yes, please create an issue for it and involve Rizky. Maybe, he has an idea or quick solution...

rizkydiprasetya · 2023-04-14T18:47:51Z

It is missing the :private-members: option on the automodule.

steveyuwono · 2023-04-14T20:13:53Z

It is missing the :private-members: option on the automodule.

Ahh okay, thank you. I will fix those

detlefarend

Hi Steve, I just reviewed it. Here are my findings:

Class StreamMLProCSV
- Could you please add a parameter description on the top level? Just the specific ones.
- Method _init_dataset(): should load the data only at the first call
- Yes, let's add a separate issue for the big data extension.
- Since we decided to make this stream a standalone one, I propose to make parameters p_csv_file and p_path_load mandatory - even on constructor level. Then, we could say: C_TYPE = 'Stream csv file' and C_NAME = p_csv_file.
RTD bf.stream
Could you please add a brief description for the sampling systematics?

The howtos run properly. Refact of Classes Sampler, SamplerRND goods good.

Furthermore, I changed/refactored some minor things on class Stream. No incompatible changes, more a small clean-up. See code log.

steveyuwono · 2023-04-16T19:55:26Z

Hi Steve, I just reviewed it. Here are my findings:

Class StreamMLProCSV

Could you please add a parameter description on the top level? Just the specific ones.

Method _init_dataset(): should load the data only at the first call

Yes, let's add a separate issue for the big data extension.

Since we decided to make this stream a standalone one, I propose to make parameters p_csv_file and p_path_load mandatory - even on constructor level. Then, we could say: C_TYPE = 'Stream csv file' and C_NAME = p_csv_file.

RTD bf.stream
Could you please add a brief description for the sampling systematics?

The howtos run properly. Refact of Classes Sampler, SamplerRND goods good.

Furthermore, I changed/refactored some minor things on class Stream. No incompatible changes, more a small clean-up. See code log.

Hi Detlef, thank you for your reviews.

I have updated the StreamMLProCSV and our RTD related to the sampler. I also added three other simple sampling methods to the pool and validate them using howto_005, such as weighted random sampler, min-wise sampler, and reservoir sampler. If everything is fine, I will merge this to the main.

detlefarend

Hi Steve, I checked the last changes. The classes look good now. I just changed the exception type in StreamMLProCSV.set_options() from NotImplementedError to ParamError. I also added a Cross reference to the rtd description and refactored the rst files a bit. So far, so good.

Regarding howto_bf_streams_005 I propose some changes to simplify it and to enable unit testing of the sampler functionality:

Native generated stream (e.g. Rnd10Dx1000) instead csv
Enabling of backend processing

After that please merge to main directly.

steveyuwono · 2023-04-17T07:31:45Z

Hi Steve, I checked the last changes. The classes look good now. I just changed the exception type in StreamMLProCSV.set_options() from NotImplementedError to ParamError. I also added a Cross reference to the rtd description and refactored the rst files a bit. So far, so good.

Regarding howto_bf_streams_005 I propose some changes to simplify it and to enable unit testing of the sampler functionality:

Native generated stream (e.g. Rnd10Dx1000) instead csv

Enabling of backend processing

After that please merge to main directly.

Okay great! I will simplify howto 005 accordingly and merge it into the main branch. Thank you!

steveyuwono added 3 commits April 10, 2023 13:48

Update RTD #642

1c91978

BF-ML: Extensions for evaluation #642

b19d574

BF-ML: Extensions for evaluation #642

ada75e6

steveyuwono added enhancement New feature or request v1.4.0 v1.4.0 Separation of Wrapper Code OA Online adaptivitiy BF Basic Functions/Infrastructure labels Apr 10, 2023

steveyuwono self-assigned this Apr 10, 2023

steveyuwono marked this pull request as draft April 10, 2023 11:54

This was linked to issues Apr 10, 2023

BF-ML: Class Model - Extensions for evaluation #642

Closed

BF-Streams: Stream sampler #643

Closed

steveyuwono added 5 commits April 10, 2023 15:19

BF-Streams: Stream sampler #643

fc7ff70

BF-Streams: Stream sampler #643

303615f

BF-Streams: Stream sampler #643

f30cfd6

BF-Streams: Stream sampler #643

4bff773

HowTo for BF-Streams: Stream sampler #643

898b3ad

detlefarend added this to the Sprint 2023/04 milestone Apr 10, 2023

Merge https://github.com/fhswf/MLPro into bf/streams/sampler

7c1609d

steveyuwono and others added 2 commits April 12, 2023 14:24

Merge branch 'main' of https://github.com/fhswf/MLPro into bf/streams…

928dc0a

…/sampler

Merge branch 'main' into bf/streams/sampler

13be816

steveyuwono linked an issue Apr 13, 2023 that may be closed by this pull request

Refact: Native stream csv_file #716

Closed

3 tasks

steveyuwono added 6 commits April 14, 2023 10:30

Merge branch 'main' of https://github.com/fhswf/MLPro into bf/streams…

ce51c07

…/sampler

update sampler and stream #643

514273e

update bf-streams class diagrams #643

7deaf88

Refact: Native stream csv_file #716

4745122

finalizing base classes and howtos #716 #643

e0859ae

debugging #716 #643

4c7a74c

update related docu #716 #643

67b636c

steveyuwono marked this pull request as ready for review April 14, 2023 14:31

steveyuwono requested a review from detlefarend April 14, 2023 14:47

steveyuwono mentioned this pull request Apr 14, 2023

RTD: Missing modules in API References #730

Closed

3 tasks

detlefarend reviewed Apr 15, 2023

View reviewed changes

This comment was marked as duplicate.

Sign in to view

detlefarend and others added 3 commits April 15, 2023 10:32

BF-Streams: Stream sampler #643

e1e306b

updates on class diagram bf.streams

71d4c83

Merge branch 'main' into bf/streams/sampler

ea68600

detlefarend added next release v1.2.0 Extension Hub and removed v1.4.0 v1.4.0 Separation of Wrapper Code labels Apr 16, 2023

steveyuwono added 3 commits April 16, 2023 19:41

Update class StreamMLProCSV #716

afca317

Add three other sampling methods #643

dd53535

rtd documentation #643

13fc94e

detlefarend removed the next release label Apr 17, 2023

BF-Streams: Stream sampler #643

ecf07c3

detlefarend approved these changes Apr 17, 2023

View reviewed changes

howto_bf_streams_005_sampler #643

bcd9e52

steveyuwono merged commit e3c630d into main Apr 17, 2023

steveyuwono deleted the bf/streams/sampler branch April 17, 2023 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BF: Stream sampler #715

BF: Stream sampler #715

steveyuwono commented Apr 10, 2023

steveyuwono commented Apr 10, 2023

detlefarend commented Apr 11, 2023 •

edited

Loading

steveyuwono commented Apr 13, 2023

steveyuwono commented Apr 14, 2023

detlefarend commented Apr 14, 2023

rizkydiprasetya commented Apr 14, 2023

steveyuwono commented Apr 14, 2023

detlefarend left a comment •

edited

Loading

This comment was marked as duplicate.

steveyuwono commented Apr 16, 2023

detlefarend left a comment

steveyuwono commented Apr 17, 2023

BF: Stream sampler #715

BF: Stream sampler #715

Conversation

steveyuwono commented Apr 10, 2023

Description

Background

Checklists:

steveyuwono commented Apr 10, 2023

detlefarend commented Apr 11, 2023 • edited Loading

steveyuwono commented Apr 13, 2023

steveyuwono commented Apr 14, 2023

detlefarend commented Apr 14, 2023

rizkydiprasetya commented Apr 14, 2023

steveyuwono commented Apr 14, 2023

detlefarend left a comment • edited Loading

Choose a reason for hiding this comment

This comment was marked as duplicate.

steveyuwono commented Apr 16, 2023

detlefarend left a comment

Choose a reason for hiding this comment

steveyuwono commented Apr 17, 2023

detlefarend commented Apr 11, 2023 •

edited

Loading

detlefarend left a comment •

edited

Loading