Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the job.process function prevent a job from stalling #299

Closed
timcosta opened this issue May 16, 2016 · 12 comments
Closed

Does the job.process function prevent a job from stalling #299

timcosta opened this issue May 16, 2016 · 12 comments
Labels
Milestone

Comments

@timcosta
Copy link

My question is basically what's in the title. I'm trying to use Bull for a queue to pipe video streams from one source to another, think moving videos from one storage area to another. When these videos grow large, Bull occasionally reports the job as stalled and restarts it, even though the upload is still occurring and I am calling job.process every 10 seconds or so. Is there a way to have job.process act as the "check in" for the job so that it doesnt report as stalled?

@manast
Copy link
Member

manast commented May 16, 2016

You will need to implement your operation as non blocking IO, by using fibers or some other asynchronous mechanism (streams maybe), depends on your use case.

@timcosta
Copy link
Author

                http.get(url, function onResponse(res){
                    managedUpload = s3.upload({
                        Body: res,
                        ACL: 'public-read',
                        Bucket: 'test-bucket',
                        Key: domain+'/prod/video/'+entryId+'.'+fileExt
                    })
                    managedUpload.on('httpUploadProgress', function(progress){
                        job.progress(progress.part);
                    });
                    managedUpload.send(function(err, data){
                        if(err){
                            return done(new Error(err));
                        }
                        console.log("Done.");
                        return done();
                    });
                });

So this is what we're doing. http is a reference to the node http library, and s3 is an instance of the official aws-sdk. From what I understand, this is passing streams back and forth, which is non-blocking IO. I see progress events registering, and then the job stalls and gets retried while the previous upload is still going. Here's what my logs look like:

Attempting to pipe videoX from Y to S3
STALLED
Attempting to pipe videoX from Y to S3
Done.
Done.

The job completes twice, and the Matador web UI shows the progress fluctuation based on which upload reported progress most recently.

@manast
Copy link
Member

manast commented May 17, 2016

Thats very weird, I will look more deeply into it.

@manast manast added the bug label May 17, 2016
@manast manast added this to the 1.0 milestone May 17, 2016
@xdc0
Copy link
Contributor

xdc0 commented Jun 15, 2016

@tjsail33 Bull will emit the stalled event when it detects a job is stalled. Do you see this event being triggered? You can register a listener by doing this:

queue.on('stalled', function (job) {
  console.log('Job %s is stalled', job.jobId);
});

@xdc0
Copy link
Contributor

xdc0 commented Jun 15, 2016

Note: currently the loop that checks for stalled jobs could actually pick a job for a first process, so this may return false positives.

I think a better one is to listen for:

queue.on('active', function (job) {
});

That would signal a job that just started processing, if you see multiple stalled events firing for the same job id, together with multiple active ones, then Bull is retrying a job that shouldn't be retrying.

@manast
Copy link
Member

manast commented Jun 16, 2016

I will close this issue for now since we are lacking response from the submitter.

@manast manast closed this as completed Jun 16, 2016
@timcosta
Copy link
Author

timcosta commented Jun 16, 2016

Hey @manast, sorry for the lack of instant response, but I was essentially sleeping for 80% of the time that passed between comments.

The answer is that yes, there were multiple active events for the same job. Here's what the output looked like:

1: active
1: Uploading to S3
1: Part 1
1: Part 2
1: Part 3
1: stalled
1: Part 4
1: active
1: Uploading to S3
1: Part 5
1: Part 1
1: Part 6
1: Part 2

It was processing the same job twice simultaneously.

@manast
Copy link
Member

manast commented Jun 16, 2016

what I see strange here is that the job got stalled to begin with. It should not happen if the event loop has not been blocked. Any chance you could post the whole process function? also, do you have any other code running on the same node process? Did you also try with the latest version 1.0.0-rc3 ?

@timcosta
Copy link
Author

timcosta commented Jun 16, 2016

Unfortunately I can't share the rest of the process function due to IP restrictions at work. This was the only job that we had running at the time, and there was nothing else being done by the node process other than process these files. I am not able to try with the latest version, as we were unfortunately forced to rewrite using another library due to time constraints. Sorry this isn't terribly helpful for debugging purposes.

If it makes a difference, the calls to job.log were appearing in the UI under the same job object, even though it was being processed twice at the same time. The order they were appearing in is the same as the order in my prior comment. So there wasnt job duplication or anything like that, it just seemed to start processing the same job twice after deciding the first one had stalled even though job.log and job.progress were being called

@xdc0
Copy link
Contributor

xdc0 commented Jun 16, 2016

1: active
1: Uploading to S3
1: Part 1
1: Part 2
1: stalled
1: Part 4
1: active
1: Uploading to S3
1: Part 5
1: Part 1
1: Part 6
1: Part 2

@tjsail33 I'm assuming that Part N are reports on progress, how come that from Part 2 it jumped to Part 4? Did it not report Part 3 or am I missing something?

@timcosta
Copy link
Author

@chuym Sorry, was a typo. Corrected it. All parts were correctly reported.

@xdc0
Copy link
Contributor

xdc0 commented Jun 20, 2016

@tjsail33 : @manast Opened a second issue to track this: #308 and is fixed on 1.0rc4, could you help us by testing out your upload process against 1.0rc4?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants