modify cumulative_distribution function #220
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi, I have used the package copulas when I need to simulate real data. I used GaussianMultivariate to fit, and then sampled data by GaussianMultivariate.sample(),but I found it was very slow when I used 20000 samples(10 features) for fitting to generate 20000 simulation data. I found that a lot time was spent in calculating the cumulative distribution function. So I modified some source code in the module gaussian_kde.py to improve the speed. In the end, my simulation speed increased by about 100 times.
Of course, as my data are all integers, so for a variable, there are many samples with the same value and I can count each value to redefine the weight. Although copula is used for continuous data, In real situations, data of int type is often used for fitting, and most of the values are the same for one variable, especially when there is lot of training data. In such situation, the simulation speed will increase a lot if we use the modified code.
best wishes