sparksampling
is a PySpark-based sampling and data quality assessment GRPC service that supports containerized
deployments and Spark On K8S
- Common sampling methods: Random, Stratified, Simple
- Relationship Sampling based on DAG and Topological sorting
- Cloud Native and Spark on K8S support
The trial only requires direct installation using pypi
pip install sparksampling
run as
sparksampling
The service will start and listen on port 8530
docker run -p 8530:8530 wh1isper/pysparksampling:latest
Using dev install
pip install -e .[test]
pre-commit install
run test
pytest -v