The project CXS (originally CXS338) is a fork of MIT Haystack's CorrelX VLBI Correlator, developed by A.J. Vazquez Alvarez on a postdoctoral research position at MIT Haystack back in 2015-2017. The original project's main objectives were "scalability, flexibility and simplicity". This project aims at adding "performance" to that list.
This project (CXS) starts as a migration of CorrelX to run on Apache Spark as part of a Masters' Thesis on Big Data at UNED by this author in 2021, as a proof of concept with the following objectives:
- Simplifying architecture and usage (simplicity).
- Migrating from Python 2 to Python 3 (flexibility).
- Migrating from Hadoop to Spark (performance).
- Running a test correlation on a cloud computing service (scalability).
About the naming convention:
- CXH227: CorrelX on Hadoop 2, Python 2.7 (CorrelX legacy).
- CXPL38: CorrelX on Pipeline, Python 3.8.
- CXS338: CorrelX on Spark 3, Python 3.8.
- CXS3311: CorrelX on Spark 3, Python 3.11.
Download Apache Spark 3.5.1 pre-built for Apache Hadoop 3:
wget https://ftp.cixug.es/apache/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
tar -xvzf spark-3.5.1-bin-hadoop3.tgz
Create environment and install requirements:
python3.11 -m venv venv3
source venv3/bin/activate
pip install -r requirements.pkg.txt
python cxs/tools/gen_symlinks.py
Add the following lines to venv3/bin/activate (replace the path as required):
export SPARK_HOME=/home/aj/spark-3.5.1-bin-hadoop3
export PYTHONPATH=$PYTHONPATH:`pwd`/src
export PYTHONPATH=$PYTHONPATH:`pwd`/cxs
Reactivate environment:
source venv3/bin/activate
bash examples/run_example_vgos.sh
bash sh/configure_hadoop_cx.sh
bash examples/run_example_vgos_hadoop.sh