- Juan Manuel Muñoz
- Carlos Miguel Patiño
- Juan Manuel Gutierrez
- Camilo Velasquez
- David R. Valencia
We prepared the data using the notebooks located in src/preprocessing/notebooks-emr
using Spark,
and 2 scripts were used to get the topics and the BERT encodings. This process was independent from the rest of the pipeline in Spark.
The notebook order for the preprocessing was:
data-pipeline-final-part2-v0
: The buckets for the users were created.data-pipelline-emr-final-v0
: Most of the raw data was processed including the creation of labels.join-bert-code-final-v0
: BERT encodings were joined with the processed data, and some extra features from the BERT like the tweet cluster belonging.join-topics-final-v0
: Some aggregations per user were done according to the tweet topics seen by each user.data-pipeline-final-part2-v0
: Buckets 2.0 were calculated looking for improvementsdata-pipeline-final-features-v0
: Additional features were calculated and some graph features joined to the processed data.
Our architecture was based on existing recommender systems that were successful on CTR prediction tasks and is summarized in the image below.
The training inputs are in the train_main.py
. The models are located in src/utils/models.py
.
To execute and train the models run:
python train_main.py --model_version "50"
To look at the learning curves and the tf info.
tensorboard --logdir src/models/logs