Data Preparation

Survey Data

dowload survey responses to data/raw_responses.tsv.
run src/analysis/Response Cleaning.ipynb to get data/responses.tsv

Trace Data

run src/traces/create_hive_traces.py. This creates a table of requests grouped by ip and ua for the given time span.

python create_hive_traces.py \
--start 2016-03-01 \
--stop 2016-03-08 \
--db traces \
--table ${version} \
--priority

run src/traces/hash_trace_ips.py. This takes the hive table of requests, hashes the ips using the supplied key, drops xff, and outputs the traces in json in HDFS.

spark-submit \
    --driver-memory 5g \
    --master yarn \
    --deploy-mode client \
    --num-executors 4 \
    --executor-memory 10g \
    --executor-cores 4 \
    --queue priority \
hash_trace_ips.py \
    --start 2016-03-01 \
    --stop 2016-03-08 \
    --input_dir /user/hive/warehouse/traces.db/${version} \
    --output_dir /user/ellery/readers/data/hashed_traces/${version} \
    --key

run src/traces/join_traces_and_clicks.py. This joins 'Yes' click events on the survey widget in EL with the hashed traces and outputs a join_data.tsv file for each day.

spark-submit \
    --driver-memory 5g \
    --master yarn \
    --deploy-mode client \
    --num-executors 4 \
    --executor-memory 20g \
    --executor-cores 4 \
    --queue priority \
join_traces_and_clicks.py \
    --start 2016-03-01 \
    --stop 2016-03-08 \
    --input_dir /user/ellery/readers/data/hashed_traces/${version} \
    --output_dir /home/ellery/readers/data/click_traces/${version}

Joining Survey Data and Traces

copy click traces to local machine

scp -r stat1002.eqiad.wmnet:/home/ellery/readers/data/click_traces/${version} ~/readers/data/click_traces/

run src/traces/Join Survey and Traces.ipynb. This generates repsonses_with_traces.tsv, which contains the survey responses for which we have traces.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commands.md

commands.md

Data Preparation

Survey Data

Trace Data

Joining Survey Data and Traces

Files

commands.md

Latest commit

History

commands.md

File metadata and controls

Data Preparation

Survey Data

Trace Data

Joining Survey Data and Traces