You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using a Pool to create VCF files and then index them (~20,000 of them) I noticed file handles were left open and actually got an exception about too many. I was able to isolate it to the pysam.tabix_index function. To reproduce this, I made a silly script. You can grab one of the process ID's the log prints and do something like lsof -p <PID> | grep 'vcf' -c and you will see the number climb until all processes are done.
"""This script reproduces the tabix index scenario where
file handles are not closed.
"""
import logging
import multiprocessing
import os
import pysam
import sys
import time
logger_fmt = "[%(levelname)s] [%(asctime)s] [%(name)s] - %(message)s"
root_logger = logging.getLogger("example")
def worker(args):
"""Simple worker to make a VCF and tabix index it. Then sleep."""
(
out_directory,
chunk,
) = args
logger = root_logger.getChild("worker")
logger.info(f"Processing chunk {chunk}")
logger.info("Process ID: {}".format(os.getpid()))
header = pysam.VariantHeader()
header.add_line("##contig=<ID=chr1,length=100,assembly=Fake>")
vcf_out_path = os.path.join(out_directory, f"chunk_{chunk}.vcf.gz")
with pysam.VariantFile(vcf_out_path, mode="wz", header=header) as vcf_obj:
for i in range(100):
record = vcf_obj.new_record(
contig="chr1", start=i, stop=i + 1, filter=("PASS"), alleles=("A", "T")
)
vcf_obj.write(record)
idx_path = pysam.tabix_index(vcf_out_path, preset="vcf", force=True)
time.sleep(1)
if __name__ == "__main__":
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter(logger_fmt, datefmt="%Y%m%d %H:%M:%S")
handler.setFormatter(formatter)
root_logger.addHandler(handler)
root_logger.setLevel(level=logging.INFO)
if len(sys.argv) != 2:
root_logger.error(
"ERROR! You must provide the path to the directory to write outputs."
)
root_logger.error("Usage: example.py /path/to/output/dir")
sys.exit(1)
odir = sys.argv[1]
if not os.path.isdir(odir):
root_logger.info(f"Creating output directory {odir}...")
os.mkdir(odir)
chunks = [(odir, i) for i in range(100)]
with multiprocessing.Pool(processes=4) as vcf_pool:
vcf_pool.map(worker, chunks)
The text was updated successfully, but these errors were encountered:
The file descriptor is being leaked from is_gzip_file(), which just needs a os.close(fd). However there is other code in tabix_index() that also opens the file to do format detection, so I will refactor it to detect both things at once instead.
Version:
pysam==0.16.0.1
Python:
3.8.9
When using a
Pool
to create VCF files and then index them (~20,000 of them) I noticed file handles were left open and actually got an exception about too many. I was able to isolate it to thepysam.tabix_index
function. To reproduce this, I made a silly script. You can grab one of the process ID's the log prints and do something likelsof -p <PID> | grep 'vcf' -c
and you will see the number climb until all processes are done.The text was updated successfully, but these errors were encountered: