Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provided train_test_set is not correct #43

Open
sudmit0802 opened this issue Feb 6, 2024 · 2 comments
Open

Provided train_test_set is not correct #43

sudmit0802 opened this issue Feb 6, 2024 · 2 comments

Comments

@sudmit0802
Copy link

There is a statement "For each of the application and traffic classification tasks, the dataset is first stratified split into train set and test set with the ratio of 80:20" in blog post https://blog.munhou.com/2020/04/05/Pytorch-Implementation-of-Deep-Packet-A-Novel-Approach-For-Encrypted-Tra%EF%AC%83c-Classi%EF%AC%81cation-Using-Deep-Learning/.
But in fact the ratio for provided dataset on link
https://drive.google.com/file/d/1EF2MYyxMOWppCUXlte8lopkytMyiuQu_/view?usp=sharing
is 20:80, so test set much bigger than train dataset:
image

@RayCxggg
Copy link

Hi, did you find out what is wrong? I find the dataset split code in /Deep-Packet/create_train_test_set.py:

def split_train_test(df, test_size, under_sampling_train=True):
    # add increasing id for df
    df = df.withColumn("id", monotonically_increasing_id())

    # stratified split
    fractions = (
        df.select("label")
        .distinct()
        .withColumn("fraction", lit(test_size))
        .rdd.collectAsMap()
    )
    test_id = (
        df.sampleBy("label", fractions, seed=9876)
        .select("id")
        .withColumn("is_test", lit(True))
    )

    df = df.join(test_id, how="left", on="id")

    train_df = df.filter(col("is_test").isNull()).select("feature", "label")
    test_df = df.filter(col("is_test")).select("feature", "label")

    # under sampling
    if under_sampling_train:
        # get label list with count of each label
        label_count_df = train_df.groupby("label").count().toPandas()

        # get min label count in train set for under sampling
        min_label_count = int(label_count_df["count"].min())

        train_df = top_n_per_group(train_df, "label", min_label_count)

    return train_df, test_df

But it seems correct to me.

@pao0626
Copy link

pao0626 commented Jun 11, 2024

My understanding is that only training dataset has the action of downsampling. For details, see the 'top_n_per_group' function in 'create_train_test_set.py'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants