Skip to content

Latest commit

 

History

History
112 lines (86 loc) · 7.76 KB

README.md

File metadata and controls

112 lines (86 loc) · 7.76 KB

SyntheticDatasets.jl

The SyntheticDatasets.jl package is a library with functions for generating synthetic artificial datasets.

Installation

The package can be installed with the Julia package manager. From the Julia REPL, type ] to enter the Pkg REPL mode and run:

pkg> add SyntheticDatasets

Or, equivalently, via the Pkg API:

julia> import Pkg; Pkg.add("SyntheticDatasets")

Examples

A set of pluto notebooks and codes demonstrating the project's current functionality is available in the examples folder.

Here are a few examples to show the Package capabilities:

using StatsPlots, SyntheticDatasets

blobs = SyntheticDatasets.make_blobs(   n_samples = 1000, 
                                        n_features = 2,
                                        centers = [-1 1; -0.5 0.5], 
                                        cluster_std = 0.25,
                                        center_box = (-2.0, 2.0), 
                                        shuffle = true,
                                        random_state = nothing);

@df blobs scatter(:feature_1, :feature_2, group = :label, title = "Blobs")

gauss = SyntheticDatasets.make_gaussian_quantiles(  mean = [10,1], 
                                                    cov = 2.0,
                                                    n_samples = 1000, 
                                                    n_features = 2,
                                                    n_classes = 3, 
                                                    shuffle = true,
						    random_state = 2);

@df gauss scatter(:feature_1, :feature_2, group = :label, title = "Gaussian Quantiles")

spirals = SyntheticDatasets.make_twospirals(n_samples = 2000, 
                                            start_degrees = 90,
                                            total_degrees = 570, 
                                            noise =0.1);

@df spirals scatter(:feature_1, :feature_2, group = :label, title = "Two Spirals")

kernel = SyntheticDatasets.make_halfkernel( n_samples = 1000, 
                                            minx = -20,
                                            r1 = 20, 
                                            r2 = 35,
                                            noise = 3.0, 
                                            ratio = 0.6);

@df kernel scatter(:feature_1, :feature_2, group = :label, title = "Half Kernel")

Datasets

The SyntheticDatasets.jl is a library with functions for generating synthetic artificial datasets. The package has some functions are interfaces to the dataset generator of the ScikitLearn.

ScikitLearn

List of package datasets:

Dataset Title Reference
make_blobs Generate isotropic Gaussian blobs for clustering. link
make_moons Make two interleaving half circles link
make_s_curve Generate an S curve dataset. link
make_regression Generate a random regression problem. link
make_classification Generate a random n-class classification problem. link
make_friedman1 Generate the “Friedman #1” regression problem. link
make_friedman2 Generate the “Friedman #2” regression problem. link
make_friedman3 Generate the “Friedman #3” regression problem. link
make_circles Make a large circle containing a smaller circle in 2d link
make_regression Generate a random regression problem. link
make_classification Generate a random n-class classification problem. link
make_low_rank_matrix Generate a mostly low rank matrix with bell-shaped singular values. link
make_swiss_roll Generate a swiss roll dataset. link
make_hastie_10_2 Generates data for binary classification used in Hastie et al. link
make_gaussian_quantiles Generate isotropic Gaussian and label samples by quantile. link

Disclaimer: SyntheticDatasets.jl borrows code and documentation from scikit-learn in the dataset module, but it is not an official part of that project. It is licensed under MIT.

Other Functions

Dataset Title Reference
make_twospirals Generate two spirals dataset. link
make_halfkernel Generate two half kernel dataset. link
make_outlier Generate outlier dataset. link