Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor tasks: GPU/Dask/Ray/Server support #557

Merged
merged 8 commits into from
Feb 7, 2020
Merged

Conversation

maartenbreddels
Copy link
Member

This PR completely refactors how task work. Now a task (e.g. an aggregation) is defined, and can be cheaply serialized, and can be executed in several ways. This PR focusses mostly on the server-side and CPU execution, but local experimental branches have shown this to work well with Dask or Ray for distributed computing, although this requires #548 for efficient serialization of data/dataframes, and some refactoring of groupby/join/take etc. It also opens to door to other task executions, e.g. on the GPU, for example using the libraries powering cuDF.

@maartenbreddels maartenbreddels added this to the 3.0 milestone Jan 21, 2020
@maartenbreddels maartenbreddels force-pushed the refactor_tasks branch 3 times, most recently from ad554ea to 218b697 Compare January 22, 2020 13:42
@maartenbreddels maartenbreddels marked this pull request as ready for review January 22, 2020 14:12
@maartenbreddels
Copy link
Member Author

This failure is so odd:
https://travis-ci.org/vaexio/vaex/jobs/640451568?utm_medium=notification&utm_source=github_status

Sometimes we get 6, sometimes 7 columns.

@maartenbreddels maartenbreddels force-pushed the refactor_tasks branch 9 times, most recently from e14eb59 to ac977a9 Compare February 4, 2020 08:45
@maartenbreddels
Copy link
Member Author

Failure was due to a cached file in ~/.vaex/data/
windows failure keeps coming back @JovanVeljanoski
https://ci.appveyor.com/project/maartenbreddels/vaex-4gh4b/builds/30636573/job/fhpi3r4hnt3lhmmm

    @pytest.mark.skipif(((1,17,0) <= version <= (1,17,5)) and platform.system().lower() == 'windows', reason="strange ref count issue with numpy")
    @pytest.mark.skipif(((1,17,0) <= version <= (1,17,3)) and platform.system().lower() == 'linux' and sys.version_info[:2] == (3,6), reason="strange ref count issue with numpy")
    def test_robust_scaler():
        x = np.array([-2.65395789, -7.97116295, -4.76729177, -0.76885033, -6.45609635])
        y = np.array([-8.9480332, -4.81582449, -3.73537263, -3.46051912,  1.35137275])
        z = np.array([-0.47827432, -2.26208059, -3.75151683, -1.90862151, -1.87541903])
        w = np.zeros_like(x)
    
        ds = vaex.from_arrays(x=x, y=y, z=z, w=w)
        df = ds.to_pandas_df()
    
        features = ['x', 'y']
    
        scaler_skl = RobustScaler()
        result_skl = scaler_skl.fit_transform(df[features])
        scaler_vaex = vaex.ml.RobustScaler(features=features)
        result_vaex = scaler_vaex.fit_transform(ds)
    
>       np.testing.assert_array_almost_equal(scaler_vaex.center_, scaler_skl.center_, decimal=0.2)

@maartenbreddels maartenbreddels deleted the refactor_tasks branch February 7, 2020 09:00
@JovanVeljanoski
Copy link
Member

Ahh.. so the windows CI starts using numpy 1.18.1, and this issue is still there.. :(

Linux also has numpy 1.18.1 but there it passes..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants