Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF: Stream sampler #715

Merged
merged 26 commits into from
Apr 17, 2023
Merged

BF: Stream sampler #715

merged 26 commits into from
Apr 17, 2023

Conversation

steveyuwono
Copy link
Collaborator

Description

Background

Checklists:

  • Read the CONTRIBUTION guide (required).
  • Update the history on the source code (required).

@steveyuwono steveyuwono added enhancement New feature or request v1.4.0 v1.4.0 Separation of Wrapper Code OA Online adaptivitiy BF Basic Functions/Infrastructure labels Apr 10, 2023
@steveyuwono steveyuwono self-assigned this Apr 10, 2023
@steveyuwono steveyuwono marked this pull request as draft April 10, 2023 11:54
@steveyuwono
Copy link
Collaborator Author

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).

The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances.
image

However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.

What do you think about the current implementation? Btw, the documentation is at the moment still missing.

Thank you!

@detlefarend detlefarend added this to the Sprint 2023/04 milestone Apr 10, 2023
@detlefarend
Copy link
Member

detlefarend commented Apr 11, 2023

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).

The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances. image

However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.

What do you think about the current implementation? Btw, the documentation is at the moment still missing.

Thank you!

Hi @steveyuwono, I had some time to review your code. Clean as always! Here are some things I found:

  • Class Stream, constructor:
    I would rather let the customer instantiate/parameterize the sampler outside the stream class. Then it's just an optional object in parameter p_sampler and you can initialize the related attribute like self._samper = p_sampler. I think, custom method _set_sampler() is not necessary. The p_kwargs are then clearly related to the stream itself and not the sampler.
  • Class Stream: new public method set_sampler()
    I recommend to add this method. You can use it in the constructor. Furthermore it allows to set a sampler after instantiation of a stream - even if it is taken from a provider.
  • Method Sampler.filter_instance
    My intuition tells me, a return value True means, the instance shall be filtered (=ignored). Others could think: filtering is not ignoring but passing through. To make it more clear, the method could be named skip_instance. What do you think?
  • Class SamplerRND
    I think I would try to simplify it and make it time-independent. And I recommend a small change on the index logic: let self._nxt_idx be an int random value from 1 to n and self._idx a value from 0 to self._nxt_idx. That avoids an integer overflow in big data scenarios (24x7 real operation of whatever stream data).

Hope that helps...

@steveyuwono steveyuwono linked an issue Apr 13, 2023 that may be closed by this pull request
3 tasks
@steveyuwono
Copy link
Collaborator Author

Hi @detlefarend , I have added a sampler for streams by introducing a Sampler class (bf.streams.models.Sampler) and adding _set_sampler() in the Stream base class. Then, I added a random sampler (bf.streams.samplers.random) and a related howto file (howto_bf_streams_005_sampler.py).
The random sampler works fine, as shown in the attached picture. It reduces 1000 instances to 282 instances. image
However, I am not totally satisfied with the randomizer logic, which I implemented. I would say that it is a little bit dirty. So I just set up a maximum number of step rates as a parameter (maximum step time for time-series data). To decide the next selected instance, the randomizer will take any random numbers between 0 to this parameter. Then, skip all of the next instances based on the selected random number.
What do you think about the current implementation? Btw, the documentation is at the moment still missing.
Thank you!

Hi @steveyuwono, I had some time to review your code. Clean as always! Here are some things I found:

  • Class Stream, constructor:
    I would rather let the customer instantiate/parameterize the sampler outside the stream class. Then it's just an optional object in parameter p_sampler and you can initialize the related attribute like self._samper = p_sampler. I think, custom method _set_sampler() is not necessary. The p_kwargs are then clearly related to the stream itself and not the sampler.
  • Class Stream: new public method set_sampler()
    I recommend to add this method. You can use it in the constructor. Furthermore it allows to set a sampler after instantiation of a stream - even if it is taken from a provider.
  • Method Sampler.filter_instance
    My intuition tells me, a return value True means, the instance shall be filtered (=ignored). Others could think: filtering is not ignoring but passing through. To make it more clear, the method could be named skip_instance. What do you think?
  • Class SamplerRND
    I think I would try to simplify it and make it time-independent. And I recommend a small change on the index logic: let self._nxt_idx be an int random value from 1 to n and self._idx a value from 0 to self._nxt_idx. That avoids an integer overflow in big data scenarios (24x7 real operation of whatever stream data).

Hope that helps...

Hi @detlefarend , thank you for reviewing this. Concerning the filter_instance method, I will just rename this method to omit_instance. I will update related codes accordingly as well as the refactoring of the native stream for csv_files, as discussed in #716.

@steveyuwono steveyuwono marked this pull request as ready for review April 14, 2023 14:31
@steveyuwono
Copy link
Collaborator Author

Hi @detlefarend , I have updated the sampler according to our discussion as well as an independent stream csv_file from mlpro stream provider, as in #716. However, to support the big data in this matter, more discussions are necessary. Should we create an issue for this or should keep #716 open? If you have time, you can review this. Thank you

Moreover, I found out that the automodule of our RTD's API reference section does not generate complete functionalities of each module, see the pictures below as an example. The same story happens in most modules. I have not found the causes, but I will have a look at them. Shall I also create an issue for this?
image
image

@detlefarend
Copy link
Member

Hi @detlefarend , I have updated the sampler according to our discussion as well as an independent stream csv_file from mlpro stream provider, as in #716. However, to support the big data in this matter, more discussions are necessary. Should we create an issue for this or should keep #716 open? If you have time, you can review this. Thank you

Yes, I'll take a look and let you know.

Moreover, I found out that the automodule of our RTD's API reference section does not generate complete functionalities of each module, see the pictures below as an example. The same story happens in most modules. I have not found the causes, but I will have a look at them. Shall I also create an issue for this? image image

Did you compare the local and online version of rtd? Is it a problem on both derivates? Yes, please create an issue for it and involve Rizky. Maybe, he has an idea or quick solution...

@rizkydiprasetya
Copy link
Contributor

It is missing the :private-members: option on the automodule.

@steveyuwono
Copy link
Collaborator Author

It is missing the :private-members: option on the automodule.

Ahh okay, thank you. I will fix those

Copy link
Member

@detlefarend detlefarend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Steve, I just reviewed it. Here are my findings:

  • Class StreamMLProCSV
    • Could you please add a parameter description on the top level? Just the specific ones.
    • Method _init_dataset(): should load the data only at the first call
    • Yes, let's add a separate issue for the big data extension.
    • Since we decided to make this stream a standalone one, I propose to make parameters p_csv_file and p_path_load mandatory - even on constructor level. Then, we could say: C_TYPE = 'Stream csv file' and C_NAME = p_csv_file.
  • RTD bf.stream
    Could you please add a brief description for the sampling systematics?

The howtos run properly. Refact of Classes Sampler, SamplerRND goods good.

Furthermore, I changed/refactored some minor things on class Stream. No incompatible changes, more a small clean-up. See code log.

detlefarend

This comment was marked as duplicate.

@detlefarend detlefarend added next release v1.2.0 Extension Hub and removed v1.4.0 v1.4.0 Separation of Wrapper Code labels Apr 16, 2023
@steveyuwono
Copy link
Collaborator Author

Hi Steve, I just reviewed it. Here are my findings:

  • Class StreamMLProCSV

    • Could you please add a parameter description on the top level? Just the specific ones.
    • Method _init_dataset(): should load the data only at the first call
    • Yes, let's add a separate issue for the big data extension.
    • Since we decided to make this stream a standalone one, I propose to make parameters p_csv_file and p_path_load mandatory - even on constructor level. Then, we could say: C_TYPE = 'Stream csv file' and C_NAME = p_csv_file.
  • RTD bf.stream
    Could you please add a brief description for the sampling systematics?

The howtos run properly. Refact of Classes Sampler, SamplerRND goods good.

Furthermore, I changed/refactored some minor things on class Stream. No incompatible changes, more a small clean-up. See code log.

Hi Detlef, thank you for your reviews.

I have updated the StreamMLProCSV and our RTD related to the sampler. I also added three other simple sampling methods to the pool and validate them using howto_005, such as weighted random sampler, min-wise sampler, and reservoir sampler. If everything is fine, I will merge this to the main.

Copy link
Member

@detlefarend detlefarend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Steve, I checked the last changes. The classes look good now. I just changed the exception type in StreamMLProCSV.set_options() from NotImplementedError to ParamError. I also added a Cross reference to the rtd description and refactored the rst files a bit. So far, so good.

Regarding howto_bf_streams_005 I propose some changes to simplify it and to enable unit testing of the sampler functionality:

  • Native generated stream (e.g. Rnd10Dx1000) instead csv
  • Enabling of backend processing

After that please merge to main directly.

@steveyuwono
Copy link
Collaborator Author

Hi Steve, I checked the last changes. The classes look good now. I just changed the exception type in StreamMLProCSV.set_options() from NotImplementedError to ParamError. I also added a Cross reference to the rtd description and refactored the rst files a bit. So far, so good.

Regarding howto_bf_streams_005 I propose some changes to simplify it and to enable unit testing of the sampler functionality:

  • Native generated stream (e.g. Rnd10Dx1000) instead csv
  • Enabling of backend processing

After that please merge to main directly.

Okay great! I will simplify howto 005 accordingly and merge it into the main branch. Thank you!

@steveyuwono steveyuwono merged commit e3c630d into main Apr 17, 2023
@steveyuwono steveyuwono deleted the bf/streams/sampler branch April 17, 2023 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BF Basic Functions/Infrastructure enhancement New feature or request OA Online adaptivitiy v1.2.0 Extension Hub
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refact: Native stream csv_file BF-Streams: Stream sampler BF-ML: Class Model - Extensions for evaluation
3 participants