Expose swarming job errors and stacktraces when they fail #1815

rhyolight · 2015-02-09T17:55:11Z

As reported initially in #1717, a JSON parse error is returned sometimes from swarming jobs that don't work. This stops the entire swarm and dumps an error like this:

Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /tmp/tmpwVsELA...
Writing  313 lines...
Writing  113 lines...
done.
None
json.loads(jobInfo.results) raised an exception.  Here is some info to help with debugging:
jobInfo:  _jobInfoNamedTuple(jobId=1006, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "351cf396-8d38-11e4-a734-685d43b983b8", "useTerminators": false, "description": {"includedFields": [{"fieldName": "timestamp", "fieldType": "datetime"}, {"maxValue": 53.0, "fieldName": "kw_energy_consumption", "fieldType": "float", "minValue": 0.0}], "streamDef": {"info": "kw_energy_consumption", "version": 1, "streams": [{"info": "Rec Center", "source": "file://rec-center-hourly.csv", "columns": ["*"]}]}, "inferenceType": "TemporalMultiStep", "inferenceArgs": {"predictionSteps": [1], "predictedField": "kw_energy_consumption"}, "iterationCount": -1, "swarmSize": "medium"}}', jobHash='5\x1c\xfa\x9e\x8d8\x11\xe4\xa74h]C\xb9\x83\xb8', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=4, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2014, 12, 26, 19, 48, 47), engCjmConnId=None, engWorkerState=None,
engStatus=None, engModelMilestones=None)
jobInfo.results:  None
EXCEPTION:  expected string or buffer
Traceback (most recent call last):
  File "swarm.py", line 109, in <module>
    swarm(INPUT_FILE)
  File "swarm.py", line 101, in swarm
    modelParams = swarmForBestModelParams(SWARM_DESCRIPTION, name)
  File "swarm.py", line 78, in swarmForBestModelParams
    verbosity=0
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
    return _runAction(runOptions)
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
    results = json.loads(jobInfo.results)
  File "/usr/lib/python2.7/site-packages/nupic/support/object_json.py", line 163, in loads
    json.loads(s, object_hook=objectDecoderHook, **kwargs))
  File "/usr/lib/python2.7/json/__init__.py", line 351, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

This error is misleading because it looks like a JSON parsing error, but it's really because one of the swarm jobs failed and the swarming system is not extracting the error proplery from that job and displaying it to the user. The jobInfo object returned from the swarm job has no results object in this case, which is causing the error above.

Instead, the program should report back the original error from the swarm job to the user instead of this worthless stacktrace.

The text was updated successfully, but these errors were encountered:

breznak · 2015-02-09T22:23:58Z

..maybe an another issue, but shouldn't an error in a single swarming thread not-kill the whole swarming process?

lovekeshvig · 2015-02-17T08:23:26Z

This seems to indicate that swarm is unable to locate the data file, try changing the data filename to a nonexistent one and you get the same error

lovekeshvig · 2015-02-17T08:23:53Z

Any suggestions on how to rectify this?

rhyolight · 2015-02-23T21:46:24Z

@lovekeshvig Sorry, been on vacation for the past week. Just catching up. This could be related to #1805. What happens if you do this?

export NUPIC=/path/to/nupic
export NTA_DATA_PATH=/path/to/nupic/examples/prediction/data

rhyolight · 2015-02-27T20:22:07Z

@lovekeshvig ping?

rhyolight · 2015-02-27T20:37:46Z

try changing the data filename to a nonexistent one and you get the same error

Actually, if I change the data file name in the swarm_description.py file, I get a different error that makes sense:

Traceback (most recent call last):
  File "/Users/mtaylor/nta/nupic/nupic/swarming/utils.py", line 430, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/mtaylor/nta/nupic/nupic/swarming/ModelRunner.py", line 237, in run
    maxTimeout=readTimeout)
  File "/Users/mtaylor/nta/nupic/nupic/data/stream_reader.py", line 210, in __init__
    self._openStream(dataUrl, isBlocking, maxTimeout, bookmark, firstRecordIdx)
  File "/Users/mtaylor/nta/nupic/nupic/data/stream_reader.py", line 294, in _openStream
    self._recordStoreName = findDataset(dataUrl[len(FILE_PREF):])
  File "/Users/mtaylor/nta/nupic/nupic/data/datasethelpers.py", line 79, in findDataset
    (datasetPath, os.environ.get('NTA_DATA_PATH', '')))
Exception: Unable to locate: rc-center-hourly.csv using NTA_DATA_PATH of

pehlert · 2015-03-24T17:51:17Z

I can second this. I have tried to run the sine example with a brand new installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the JSON parser error. I have added some debugging and it boils down to jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing  313 lines...
Writing  113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing  313 lines...
Writing  113 lines...
done.
None
json.loads(jobInfo.results) raised an exception.  Here is some info to help with debugging:
jobInfo:  _jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": ["*"]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results:  None
EXCEPTION:  expected string or buffer
Traceback (most recent call last):
  File "sine_experiment.py", line 104, in <module>
    run_sine_experiment()
  File "sine_experiment.py", line 76, in run_sine_experiment
    model_params = swarm_over_data()
  File "sine_experiment.py", line 69, in swarm_over_data
    {'maxWorkers': 8, 'overwrite': True})
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
    return _runAction(runOptions)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
    results = json.loads(jobInfo.results)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
    json.loads(s, object_hook=objectDecoderHook, **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 351, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

passiweinberger · 2015-03-24T18:04:11Z

I am getting the exact same one on ubuntu 14.02LTS. Reinstalled nupic and still get it.

-------- Original message --------

From: Pascal Ehlert notifications@github.com

Date:03/24/2015 18:51 (GMT+01:00)

To: numenta/nupic nupic@noreply.github.com

Subject: Re: [nupic] Expose swarming job errors and stacktraces when they fail (#1815)

I can second this. I have tried to run the sine example with a brand new installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the JSON parser error. I have added some debugging and it boils down to jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer
—
Reply to this email directly or view it on GitHub.

pehlert · 2015-03-24T18:05:38Z

I can also confirm that this happens on Ubuntu 13.10. Installing 14.10 right now to check.

passiweinberger · 2015-03-24T18:08:38Z

Has probab more to do with json... is there older alternatives? Btw. Hey Pascal , regards from Pascal :P

-------- Original message --------

From: Pascal Ehlert notifications@github.com

Date:03/24/2015 19:05 (GMT+01:00)

To: numenta/nupic nupic@noreply.github.com

Cc: Pascal Weinberger passiweinberger@gmail.com

Subject: Re: [nupic] Expose swarming job errors and stacktraces when they fail (#1815)

I can also confirm that this happens on Ubuntu 13.10, am installing 14.10 right now to check.

—
Reply to this email directly or view it on GitHub.

pehlert · 2015-03-24T18:12:18Z

Nope, it's most likely not a json issue. The place where it occurs is when it's parsing the individual swarm job results which are supposed to be in json (as I understand it). However instead of a json string, jobInfo.results evaluates to None. I'd expect a swarm job to have crashed without printing an error message. And cheers to Germany, Mr. Pascal ;-)

passiweinberger · 2015-03-24T18:24:03Z

completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, ...
might this cause an error ? (I'm not familiar with json, but all the None values look suspicious.)

rhyolight · 2015-03-24T18:33:38Z

Guys, I think the issue you're having might have been fixed with #1902, this has not been included in a binary release, so in order to test you'll need to compile locally following the README instructions.

pehlert · 2015-03-24T18:59:59Z

This has indeed solved the problem, thank you! Would you mind giving a quick explanation on what happened here?

rhyolight · 2015-03-24T19:06:19Z

Sure. We were using an old method of finding data files, which did not work when NuPIC was installed from a binary package because all search paths were dependent on environment variables. In the PR I linked above, I ripped all that out and put in place the standard python file packaging method, so that data files packaged within the binary installation can be found.

pehlert · 2015-03-24T19:24:30Z

Okay, little throwback here: I tried to run my swarm script again in a new shell and it failed. This only seems to work as long as the NUPIC env variable is set to nupic's build directory.

breznak · 2015-03-24T19:59:20Z

On Tue, Mar 24, 2015 at 8:24 PM, Pascal Ehlert notifications@github.com
wrote:

Okay, little throwback here: I tried to run my swarm script again in a new
shell and it failed. This only seems to work as long as the NUPIC env
variable is set to nupic's build directory.

that is the point of NUPIC variable though

rhyolight · 2015-03-24T20:15:39Z

I don't think users should need to set NUPIC in order to run swarms. The new data lookup procedure should find data that is relative to the current working directory. #1947

pehlert · 2015-03-24T20:56:26Z

For everyone who stumbles across this, you can get swarming to work by setting the NUPIC env variable manually to the package location. If you installed nupic via pip, you can simply do this before you run your script:

export NUPIC="$(pip show nupic | grep 'Location:' | sed 's/Location: //')/nupic"

If, like me, you try to run one of the examples out there (e.g. sine prediction or the gym tutorial) under 0.2.1, also note that support for relative file paths in the swarming spec is broken in that version. Use absolute paths instead and you should be fine.

lovekeshvig · 2015-03-25T03:51:39Z

This is due to your permissions setting, set yourself as root using sudo -s
and then run

On Tue, Mar 24, 2015 at 11:21 PM, Pascal Ehlert notifications@github.com
wrote:

I can second this. I have tried to run the sine example with a brand new
installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the
JSON parser error. I have added some debugging and it boils down to
jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

—
Reply to this email directly or view it on GitHub
#1815 (comment).

Lovekesh Vig
Assistant Professor
School of Computational and Integrative Sciences
Jawaharlal Nehru University

rhyolight · 2015-03-25T05:01:06Z

Thanks guys.

Sent from my MegaPhone

On Mar 24, 2015, at 8:51 PM, lovekeshvig notifications@github.com wrote:

This is due to your permissions setting, set yourself as root using sudo -s
and then run

On Tue, Mar 24, 2015 at 11:21 PM, Pascal Ehlert notifications@github.com
wrote:

I can second this. I have tried to run the sine example with a brand new
installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the
JSON parser error. I have added some debugging and it boils down to
jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

—
Reply to this email directly or view it on GitHub
#1815 (comment).

Lovekesh Vig
Assistant Professor
School of Computational and Integrative Sciences
Jawaharlal Nehru University
—
Reply to this email directly or view it on GitHub.

pehlert · 2015-03-25T10:00:37Z

@lovekeshvig Thank you, but I cannot confirm this. Running as root without having the NUPIC variable set fails just like before.

passiweinberger · 2015-03-29T22:45:12Z

This should fix it: #1968

pehlert · 2015-04-04T17:12:38Z

The swarming output is suppressed by assigning a temp file to the outputs here: https://github.com/numenta/nupic/blob/master/nupic/swarming/permutations_runner.py#L632

I am currently working on a better solution, let me know if you have any suggestions.

andrewmalta13 · 2016-06-09T19:34:22Z

any way to reliably reproduce this bug?

rhyolight · 2016-06-15T19:28:38Z

@andrewmalta13 You might try changing a line of the input CSV later in the file to a different data type, like changing a 0.45 to foo. That should throw a runtime error, and that is the type of error that's being suppressed.

andrewmalta13 · 2016-06-17T19:22:08Z

@rhyolight just tried that on examples/opf/clients/hotgym/prediction/one_gym/rec-center-hourly.csv

I changed on of the entries in "kw_energy_consumption" column to "oops" and ran swarm.py.

I received the stack trace:

This script runs a swarm on the input data (rec-center-hourly.csv) and
creates a model parameters file in the `model_params` directory containing
the best model found by the swarm. Dumps a bunch of crud to stdout because
that is just what swarming does at this point. You really don't need to
pay any attention to it.

=================================================
= Swarming on rec-center-hourly data...
= Medium swarm. Sit back and relax, this could take awhile.
=================================================
Generating experiment files in directory: /Users/amalta/nta/nupic/examples/opf/clients/hotgym/prediction/one_gym/swarm...
Writing  313 lines...
Writing  114 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1060
<jobID: 1060> 6  models finished [success: 0; eof: 0; stopped: 0; killed: 0; ERROR: 6; ORPHANED: 0; unknown: 0]
ERROR MESSAGE: Exception occurred while running model 21816: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops

##>> UPDATED WORKER STATE: 
{   u'activeSwarms': [   u'modelParams|sensorParams|encoders|kw_energy_consumption',
                         u'modelParams|sensorParams|encoders|timestamp_dayOfWeek',
                         u'modelParams|sensorParams|encoders|timestamp_timeOfDay',
                         u'modelParams|sensorParams|encoders|timestamp_weekend'],
    u'blackListedEncoders': [],
    u'lastGoodSprint': None,
    u'lastUpdateTime': 1466191155.69385,
    u'searchOver': False,
    u'sprints': [   {   u'bestErrScore': None,
                        u'bestModelId': None,
                        u'status': u'active'}],
    u'swarms': {   u'modelParams|sensorParams|encoders|kw_energy_consumption': {   u'bestErrScore': None,
                                                                                   u'bestModelId': None,
                                                                                   u'sprintIdx': 0,
                                                                                   u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_dayOfWeek': {   u'bestErrScore': None,
                                                                                 u'bestModelId': None,
                                                                                 u'sprintIdx': 0,
                                                                                 u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_timeOfDay': {   u'bestErrScore': None,
                                                                                 u'bestModelId': None,
                                                                                 u'sprintIdx': 0,
                                                                                 u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_weekend': {   u'bestErrScore': None,
                                                                               u'bestModelId': None,
                                                                               u'sprintIdx': 0,
                                                                               u'status': u'active'}}}
####>> UPDATED JOB RESULTS: 
{   u'absoluteFieldContributions': {   u'kw_energy_consumption': nan,
                                       u'timestamp_dayOfWeek': nan,
                                       u'timestamp_timeOfDay': nan,
                                       u'timestamp_weekend': nan},
    u'fieldContributions': {   u'kw_energy_consumption': nan,
                               u'timestamp_dayOfWeek': nan,
                               u'timestamp_timeOfDay': nan,
                               u'timestamp_weekend': nan}} (elapsed time: 1.01959 secs)
Evaluated 6 models
HyperSearch finished!
Worker completion message: E10002: Exiting due to receiving too many models failing from exceptions (6 out of 6). 
Model Exception: Exception occurred while running model 21847: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops


Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /var/folders/lm/bgmmckjn0xq4nr9t2tbqj3900000gp/T/tmp3nJ5Se...
Writing  313 lines...
Writing  114 lines...
done.
None
Traceback (most recent call last):
  File "swarm.py", line 109, in <module>
    swarm(INPUT_FILE)
  File "swarm.py", line 101, in swarm
    modelParams = swarmForBestModelParams(SWARM_DESCRIPTION, name)
  File "swarm.py", line 78, in swarmForBestModelParams
    verbosity=0
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 277, in runWithConfig
    return _runAction(runOptions)
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 218, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 161, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 825, in generateReport
    raise Exception(jobInfo.workerCompletionMsg)
Exception: E10002: Exiting due to receiving too many models failing from exceptions (6 out of 6). 
Model Exception: Exception occurred while running model 21847: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops

rhyolight · 2016-06-17T21:50:17Z

@andrewmalta13 So that did not replicate the problem. Try @lovekeshvig's suggestion above?

andrewmalta13 · 2016-06-20T16:47:11Z

Also reports the error as I would expect:

... (omitted for length)

IOError: [Errno 2] No such file or directory: u'/Users/amalta/nta/nupic/examples/opf/clients/hotgym/prediction/one_gym/rec-center-houry.csv'

andrewmalta13 · 2016-06-20T16:47:25Z

@rhyolight are you sure this issue hasn't been addressed? Perhaps by this PR: #2205

rhyolight · 2016-06-20T17:21:52Z

@andrewmalta13 It was reported on HTM Forum a month ago.

andrewmalta13 · 2016-06-20T17:28:19Z

Huh, strange. I guess I will keep trying to reproduce it.

rhyolight · 2016-06-20T17:49:33Z

@andrewmalta13 If it gets too tedious, maybe just leave it alone until I get another report of the error, then we can both work with the user getting the error to try and replicate.

andrewmalta13 · 2016-06-20T18:11:19Z

👍

rhyolight added type:bug subject:swarming priority:2 labels Feb 9, 2015

rhyolight added this to the 0.3.0 milestone Feb 9, 2015

rhyolight mentioned this issue Feb 9, 2015

JSON validation errors should return a clear error to users #1717

Closed

rhyolight self-assigned this Feb 23, 2015

rhyolight modified the milestones: 0.7.0, 0.3.0 Feb 25, 2015

passiweinberger mentioned this issue Mar 30, 2015

Remove dependency from $NUPIC environment variable. Fixes numenta/nupic#1815, numenta/nupic#1947 #1970

Closed

rhyolight modified the milestones: 0.7.0: Bug bash, 1.0.0 Mar 30, 2015

rhyolight mentioned this issue Jan 13, 2016

Catching TypeErrors to expose swarm failure errors. #2942

Closed

rhyolight assigned andrewmalta13 and unassigned rhyolight Jun 7, 2016

JonnoFTW mentioned this issue Mar 7, 2018

Fix metric spec schema bug #3812

Merged

rhyolight closed this as completed in #3812 Mar 7, 2018

Expose swarming job errors and stacktraces when they fail #1815

Expose swarming job errors and stacktraces when they fail #1815

Comments

rhyolight commented Feb 9, 2015

breznak commented Feb 9, 2015

lovekeshvig commented Feb 17, 2015

lovekeshvig commented Feb 17, 2015

rhyolight commented Feb 23, 2015

rhyolight commented Feb 27, 2015

rhyolight commented Feb 27, 2015

pehlert commented Mar 24, 2015

passiweinberger commented Mar 24, 2015

Results from all experiments:

pehlert commented Mar 24, 2015

passiweinberger commented Mar 24, 2015

pehlert commented Mar 24, 2015

passiweinberger commented Mar 24, 2015

rhyolight commented Mar 24, 2015

pehlert commented Mar 24, 2015

rhyolight commented Mar 24, 2015

pehlert commented Mar 24, 2015

breznak commented Mar 24, 2015

rhyolight commented Mar 24, 2015

pehlert commented Mar 24, 2015

lovekeshvig commented Mar 25, 2015

Results from all experiments:

rhyolight commented Mar 25, 2015

Results from all experiments:

pehlert commented Mar 25, 2015

passiweinberger commented Mar 29, 2015

pehlert commented Apr 4, 2015

andrewmalta13 commented Jun 9, 2016 • edited Loading

rhyolight commented Jun 15, 2016

andrewmalta13 commented Jun 17, 2016

rhyolight commented Jun 17, 2016

andrewmalta13 commented Jun 20, 2016

andrewmalta13 commented Jun 20, 2016 • edited Loading

rhyolight commented Jun 20, 2016

andrewmalta13 commented Jun 20, 2016

rhyolight commented Jun 20, 2016

andrewmalta13 commented Jun 20, 2016

andrewmalta13 commented Jun 9, 2016 •

edited

Loading

andrewmalta13 commented Jun 20, 2016 •

edited

Loading