Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update docs #39

Merged
merged 1 commit into from
Aug 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 47 additions & 49 deletions docs/source/contribute.md
Original file line number Diff line number Diff line change
@@ -1,70 +1,67 @@
# How to contribute to pyKT?
Everyone is welcome to contribute, and we value everybody's contribution.
pyKT is still under the development. More KT models and datasets are going to be added and we always welcome contributions to help make pyKT better.


## You can contribute in so many ways!
There are some ways you can contribute to pyKT:
1. Find bugs and create an issue.
## Guidance
<!-- There are some ways you can contribute to pyKT: -->
Thank you for your interest in contributing to pyKT! You can make the following contributions to pyKT:
1. Bug-fix for an outstanding issue.
2. Add new datasets.
3. Implementing new models.
3. New model implementations.

## Install for Development
1、Clone pykt repositoriy
1、For this repository to and switch to dev branch (Notice: Do not work on the main branch).

```shell
git clone https://github.com/pykt-team/pykt-toolkit
```

2、Change to dev branch

```shell
cd pykt-toolkit
git checkout dev
```

**Do not** work on the main branch.

3、Editable install
2、Editable Installation

You can use the following command to install the pykt library.

```shell
pip install -e .
```
In this mode, every modification in `pykt` directory will take effect immediately. You do not need to reinstall the package again.
In this way, every change made to the `pykt` directory will be effective immediately. The package does not require another installation again.

4、Push to remote(dev)
3、Push to remote(dev)

After development models or fix bugs, you can push your codes to dev branch.
After implementing the new models or fix bugs, you can push your codes to dev branch.


The main branch is **not allowed** to push codes (the push will be failed). You can use a Pull Request to merge your code from **dev** branch to the main branch. We will reject the Pull Request from another branch to main branch, you can merge to dev branch first.
The main branch is **not be allowed** to push codes (the push submission will be failed). You can submit a Pull Request to merge your code from **dev** branch to the main branch. We will refuse the Pull Request from other branchs to the main branch except for dev branch.



## Add Your Datasets

In this section, we will use the `ASSISTments2015` dataset to show the add dataset procedure. Here we use `assist2015` as the dataset name, you can change `assist2015` to your dataset name.
In this section, we will use the `ASSISTments2015` dataset as an example to show the adding dataset procedure. Here we simplize the `ASSISTments2015` into `assist2015` as the dataset name, you can replace `assist2015` in your own dataset.

### Add Data Files
1、Add a new dataset folder in the `data` directory with the name of the dataset.
### Create Data Files
1、Please create a new dataset folder in the `data` directory with dataset name.

2、Then, you can store the raw files in this directory. Here is the `assist2015` file structure:
```shell
mkdir -p ./data/assist2015
```

2、You should add the raw files of the new dataset into the named directory, like the file structure of `assist2015`:

```shell
$tree data/assist2015/
├── 2015_100_skill_builders_main_problems.csv
```

3、Then add the data path to `dname2paths` of `examples/data_preprocess.py`.
3、You need to provide the data path of new dataset to `dname2paths` of `examples/data_preprocess.py`.

![](../pics/dataset-add_data_path.jpg)

### Write Python Script
### Data Preprocess File

1、Create the processing script `assist2015_preprocess.py` in `pykt/preprocess` directory. Before write the preprocess python scipt you are suggestd to read the [Data Preprocess Standards](#Data Preprocess Standards), which contains some guidlines to process dataset. Here is the scipt for `assist2015` we show the main steps, full codes can see in `pykt/preprocess/algebra2005_preprocess.py`.
1、Create the processing file `assist2015_preprocess.py` under the `pykt/preprocess` directory. The data preprocessing are suggestd to follow the [Data Preprocess Guidelines](#Data Preprocess Guidelines). The main codes of the data preprocessing of `assist2015` are as follows:

<!--
```python
import pandas as pd
from pykt.utils import write_txt, change2timestamp, replace_text
Expand Down Expand Up @@ -103,46 +100,47 @@ def read_data_from_csv(read_file, write_file):
[[u, str(seq_len)], seq_problems, seq_skills, seq_ans, seq_start_time, seq_use_time])

write_txt(write_file, data)
``` -->
```
The entire codes can be seen in `pykt/preprocess/algebra2005_preprocess.py`.

2、Import the preprocess file in `pykt/preprocess/data_proprocess.py`.
2、Import the preprocess file to `pykt/preprocess/data_proprocess.py`.


![](../pics/dataset-import.jpg)



### Data Preprocess Standards
### Data Preprocess Guidelines
#### Field Extraction

For any data set, we mainly extract 6 fields: user ID, question ID (name), skill ID (name), answering status, answer submission time, and answering time (if the field does not exist in the dataset, it is represented by NA) .
For each datset, we mainly extract 6 fields for model training: user ID, question ID (name), skill ID (name), answering results, answer submission time, and answering duration(if the field does not exist in the dataset, it is represented by "NA").

#### Data Filtering

For each answer record, if any of the five fields of user ID, question ID (name), skill ID (name), answer status, and answer submission time are empty, the answer record will be deleted.
The interaction with lacks one of the five fields in user ID, question ID (name), skill ID (name), answering results, answer submission time will be filtered out.

#### Data Sorting
#### Data Ordering

Each student's answer sequence is sorted according to the answer order of the students. If different answer records of the same student appear in the same order, the original order is maintained, that is, the order of the answer records in the original data set is kept consistent.
A student's interaction sequence is order according to the answer submission time. The order will be kept consistent with the original relative position for different interactions with the same submission time.

#### Character Process
#### Character Processing

- **Field concatenation:** Use `----` as the connecting symbol. For example, Algebra2005 needs to concatenate `Problem Name` and `Step Name` as the final problem name.
- **Character replacement:** If there is an underline `_` in the question and skill of original data, replace it with `####`. If there is a comma `,` in the question and skill of original data, replace it with `@@@@`.
- **Multi-skill separator:** If there are multiple skills in a question, we separate the skills with an underline `_`.
- **Time format:** The answer submission time is a millisecond (ms) timestamp, and the answer time is in milliseconds (ms).
- **Time format:** The answer submission time is a millisecond (ms) timestamp, and the answering duration is in milliseconds (ms).

#### Output data format
#### Output Data Format

After completing the above data preprocessing, each dataset will generate a data.txt file in the folder named after it (data directory). Each student sequence contains 6 rows of data as follows:
After completing the above data preprocessing, you will get a data.txt file under the dataset namely folder(data directory). Each student sequence contains 6 rows informations as follows:

```
User ID, sequence length
Question ID (name)
skill ID (name)
Answer status
Answer result
Answer submission time
time to answer
Answering duration
```

Example:
Expand All @@ -157,20 +155,20 @@ Example:
```


## Add Your Models

### create a new model file
Our models are all in "pykt/models" directory, when you add a new model, please create a new file named "model_name.py" in "pykt/models".
## Add Your Models
### Create a New Model File
Our models are all in `pykt/models` directory. When you add a new model, please create a file named `{model_name}.py` in `pykt/models`.
You can write your model file using "pykt/models/dkt.py" as a reference.

### init your model
You need add your model in "pykt/models/init_model.py" to init it by change "init_model" function.
### Init Your Model
You need to add your model in `pykt/models/init_model.py` to init it by changing the `init_model` function.

### add to the training process
### Add to the Training Process

1.You should change the "model_forward" and "cal_loss" functions in "pykt/models/train_model.py" to add your model to the training process, you can refer to other models.
1. You should change the `model_forward` and `cal_loss` function in `pykt/models/train_model.py` to add your model to the training process, you can refer to other models.

2.Run "wandb_train.py" to train the new model
2. Run `wandb_train.py` to train the new model

### add to the evaluation process
You can change the "evaluate_model.py" file, change "evaluate" function to get the repeated knowledge concepts evaluation, change "evaluate_question" function to get the question evalua,tion results, change "predict_each_group" and "predict_each_group" to get the multi-step prediction results of accumulative and non-accumulative predictions.
### Add to the Evaluation Process
You can change the `evaluate_model.py` file, change the `evaluate` function to get the repeated knowledge concepts evaluation, change the `evaluate_question` function to get the question evaluation results, change `predict_each_group` and `predict_each_group` to get the multi-step prediction results of accumulative and non-accumulative predictions.
4 changes: 2 additions & 2 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Welcome to pyKT's documentation!
========================================
pyKT is a python library build upon PyTorch to train deep learning based knowledge tracing models. The library consists of a standardized set of integrated data preprocessing procedures on multi popular datasets across different domains, 5 detailed prediction scenarios, frequently compared DLKT approaches for transparent and extensive experiments.
pyKT is a python library build upon PyTorch to train deep learning based knowledge tracing (KT) models. The library consists of a standardized set of integrated data preprocessing procedures on multi popular datasets across different domains, 5 detailed prediction scenarios, frequently compared DLKT approaches for transparent and extensive experiments.

Let's `Get Started! <./quick_start.html>`_
Let's Get Started! `English Introduction <./quick_start.html>`_.

More details about the academic information can be read in our paper at https://arxiv.org/abs/2206.11460?context=cs.CY .

Expand Down
2 changes: 1 addition & 1 deletion docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Installation
Use the following command to install pyKT:
Since pyKT is a python-based library, you can specify to install it with the following command:

Create conda environment.

Expand Down
1 change: 0 additions & 1 deletion docs/source/pykt.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@ Subpackages
.. toctree::
:maxdepth: 4

pykt.config
pykt.datasets
pykt.models
pykt.preprocess
Expand Down
43 changes: 22 additions & 21 deletions docs/source/quick_start.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Welcome to pyKT
# Quick Start

## Installation
You can specify to install it through `pip`.
Expand All @@ -7,7 +7,7 @@ You can specify to install it through `pip`.
pip install -U pykt-toolkit
```

We recommend creating a new Conda environment using the following command.
We advise to create a new Conda environment with the following command:

```shell
conda create --name=pykt python=3.7.5
Expand All @@ -17,11 +17,12 @@ pip install -U pykt-toolkit

## Train Your First Model
### Prepare a Dataset
**1、Download a Dataset**
**1、Obtain a Dataset**

You can find the download link for a dataset from [here](datasets.md). Download the dataset to the `data/{dataset_name}` folder.
Let's start by downloading the dataset from [here](datasets.md). Please make sure you have creat the `data/{dataset_name}` folder
<!-- You can find the download link for a dataset from [here](datasets.md). Download the dataset to the `data/{dataset_name}` folder. -->

**2、Preprocess the Dataset**
**2、Data Preprocessing**

`python data_preprocess.py [parameter]`

Expand All @@ -41,17 +42,15 @@ python data_preprocess.py --dataset_name=ednet
```

### Training a Model
After processed the dataset, you can use the `python wandb_modelname_train.py [parameter]` to train a model:
After the data preprocessing, you can use the `python wandb_modelname_train.py [parameter]` to train a model:

```shell
CUDA_VISIBLE_DEVICES=2 nohup python wandb_sakt_train.py --dataset_name=assist2015 --use_wandb=0 --add_uuid=0 --num_attn_heads=2 > sakt_train.txt &
```

Run the `get_wandb_new` file. If the model has selected more than 300 sets of parameters, and the most recent 100 sets of parameters have not achieved optimal results on the test set (that is, when end! is output), stop.

## Evaluating Your Model

After train you model, you can use `wandb_predict.py` to evaluate the trained model's performance in the datasets.
Now, let's use `wandb_predict.py` to evaluate the model performance on the testing set.

`python wandb_predict.py`

Expand All @@ -68,14 +67,14 @@ Args:

### Create a Wandb Account

Weights & Biases (Wandb) is the machine learning platform for developers to build better models faster. Fisrly, you should register an account in [Wandb](https://wandb.ai/) webpage, hhen you can get the API key from [here](https://wandb.ai/settings):
We use Weights & Biases (Wandb) for hyperparameter tuning, it is a machine learning platform for developers to build better models faster with experiment tracking. Firstly, let's register an account in [Wandb](https://wandb.ai/) webpage to get the API key from [here](https://wandb.ai/settings):

![](../pics/api_key.png)


Final, add your `uid` and `api_key` in `configs/wandb.json`.
Next, add your `uid` and `api_key` into `configs/wandb.json`.

### Write a Sweep Config
### Sweep Configuration

`python generate_wandb.py [parameter]`

Expand All @@ -92,7 +91,7 @@ Args:
--generate_all: The input is "True" or "False", indicating whether to generate the wandb startup files of all datasets and models in the all_dir directory (True means: generate the startup files of all data models in the all_dir directory, False means: only the current execution is generated data model startup file), default: "False"
```

### Start the Sweep
### Start Sweep

**Step1**: `sh [launch_file] [parameter]`

Expand All @@ -108,7 +107,7 @@ Example:
```shell
python generate_wandb.py --dataset_names="assist2009,assist2015" --model_names="dkt,dkt+"
sh all_start.sh > log.all 2>&1
(The log file needs to be defined by yourself. )
(You need to define the log file. )
```

**Step 2:** `sh run_all.sh [parameter]`
Expand Down Expand Up @@ -137,29 +136,31 @@ sh run_all.sh log.all 0 5 assist2015 dkt 0,1,2,3,4

```shell
sh start_sweep_0_5.sh
(0_5 represents the start sweep and end sweep)
("0", "5" denote the start sweep and end sweep respectively.)
```
### Tuning Protocol

We use the Bayes search method to find the best hyperparameter, it is expensive to run all the hyperparameter combinations. Hence, you can run the `get_wandb_new` file to check whether to stop the searching. We default to stop the searching if the number of the tuned hyperparameter combinations in each data fold is larger than 300 and there is no AUC improvement on the testing data in the last 100 rounds (output "end!").

### Start Evaluate
### Start Evaluation


Run the `get_wandb_new` file to generate the `{modal name}_{emb type}_pred.yaml` file, modify the program keyword in the YAML file, and change its path to `./wandb_predict.py` or `wandb_predict.py` .
Run the `get_wandb_new` file to generate the `{modal name}_{emb type}_pred.yaml` file, modify the program keyword in the YAML file, and change the model path to `./wandb_predict.py` or `wandb_predict.py`.

Then, execute the following command:

```shell
WANDB_API_KEY=xxx wandb sweep all_wandbs/dkt_qid_pred.yaml -p pykt_wandb
#(xxx is your api_key, pykt_wandb is your project name)
#(xxx is your api_key, pykt_wandb is your wandb project name)

#i.e.
#e.g.,
CUDA_VISIBLE_DEVICES=0 WANDB_API_KEY=xxx nohup wandb agent swwwish/pykt_wandb/qn91y02m &
# qn91y02m is the agent name generated after the first command line is executed
```

![](../pics/predict.png)


In this stage, only 5 sweeps will be run, and no parameter tuning will be involved. After the end, export the results externally or call the wandb API for statistical results, and calculate the mean and standard deviation of each indicator in the five sweeps. The final comprehensive result is: ***mean ± standard deviation***
There are only 5 sweeps to be run without any parameter tuning in this stage, with each sweep corresponding to the evaluation of each fold of the training data. Finally, you can export the evaluation results externally or call the wandb API for statistical 5- folds results, and calculate the mean and standard deviation of each metric, i.e., ***mean ± standard deviation***

If you want to add you models or datasets, you can read [Contribute](contribute.md).
If you want to add new models or datasets into PyKT, you can follow [Contribute](contribute.md).