Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose swarming job errors and stacktraces when they fail #1815

Closed
rhyolight opened this issue Feb 9, 2015 · 34 comments · Fixed by #3812
Closed

Expose swarming job errors and stacktraces when they fail #1815

rhyolight opened this issue Feb 9, 2015 · 34 comments · Fixed by #3812

Comments

@rhyolight
Copy link
Member

As reported initially in #1717, a JSON parse error is returned sometimes from swarming jobs that don't work. This stops the entire swarm and dumps an error like this:

Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /tmp/tmpwVsELA...
Writing  313 lines...
Writing  113 lines...
done.
None
json.loads(jobInfo.results) raised an exception.  Here is some info to help with debugging:
jobInfo:  _jobInfoNamedTuple(jobId=1006, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "351cf396-8d38-11e4-a734-685d43b983b8", "useTerminators": false, "description": {"includedFields": [{"fieldName": "timestamp", "fieldType": "datetime"}, {"maxValue": 53.0, "fieldName": "kw_energy_consumption", "fieldType": "float", "minValue": 0.0}], "streamDef": {"info": "kw_energy_consumption", "version": 1, "streams": [{"info": "Rec Center", "source": "file://rec-center-hourly.csv", "columns": ["*"]}]}, "inferenceType": "TemporalMultiStep", "inferenceArgs": {"predictionSteps": [1], "predictedField": "kw_energy_consumption"}, "iterationCount": -1, "swarmSize": "medium"}}', jobHash='5\x1c\xfa\x9e\x8d8\x11\xe4\xa74h]C\xb9\x83\xb8', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=4, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2014, 12, 26, 19, 48, 47), engCjmConnId=None, engWorkerState=None,
engStatus=None, engModelMilestones=None)
jobInfo.results:  None
EXCEPTION:  expected string or buffer
Traceback (most recent call last):
  File "swarm.py", line 109, in <module>
    swarm(INPUT_FILE)
  File "swarm.py", line 101, in swarm
    modelParams = swarmForBestModelParams(SWARM_DESCRIPTION, name)
  File "swarm.py", line 78, in swarmForBestModelParams
    verbosity=0
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
    return _runAction(runOptions)
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/usr/lib/python2.7/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
    results = json.loads(jobInfo.results)
  File "/usr/lib/python2.7/site-packages/nupic/support/object_json.py", line 163, in loads
    json.loads(s, object_hook=objectDecoderHook, **kwargs))
  File "/usr/lib/python2.7/json/__init__.py", line 351, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

This error is misleading because it looks like a JSON parsing error, but it's really because one of the swarm jobs failed and the swarming system is not extracting the error proplery from that job and displaying it to the user. The jobInfo object returned from the swarm job has no results object in this case, which is causing the error above.

Instead, the program should report back the original error from the swarm job to the user instead of this worthless stacktrace.

@breznak
Copy link
Member

breznak commented Feb 9, 2015

..maybe an another issue, but shouldn't an error in a single swarming thread not-kill the whole swarming process?

@lovekeshvig
Copy link

This seems to indicate that swarm is unable to locate the data file, try changing the data filename to a nonexistent one and you get the same error

@lovekeshvig
Copy link

Any suggestions on how to rectify this?

@rhyolight
Copy link
Member Author

@lovekeshvig Sorry, been on vacation for the past week. Just catching up. This could be related to #1805. What happens if you do this?

export NUPIC=/path/to/nupic
export NTA_DATA_PATH=/path/to/nupic/examples/prediction/data

@rhyolight rhyolight self-assigned this Feb 23, 2015
@rhyolight rhyolight modified the milestones: 0.7.0, 0.3.0 Feb 25, 2015
@rhyolight
Copy link
Member Author

@lovekeshvig ping?

@rhyolight
Copy link
Member Author

try changing the data filename to a nonexistent one and you get the same error

Actually, if I change the data file name in the swarm_description.py file, I get a different error that makes sense:

Traceback (most recent call last):
  File "/Users/mtaylor/nta/nupic/nupic/swarming/utils.py", line 430, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/mtaylor/nta/nupic/nupic/swarming/ModelRunner.py", line 237, in run
    maxTimeout=readTimeout)
  File "/Users/mtaylor/nta/nupic/nupic/data/stream_reader.py", line 210, in __init__
    self._openStream(dataUrl, isBlocking, maxTimeout, bookmark, firstRecordIdx)
  File "/Users/mtaylor/nta/nupic/nupic/data/stream_reader.py", line 294, in _openStream
    self._recordStoreName = findDataset(dataUrl[len(FILE_PREF):])
  File "/Users/mtaylor/nta/nupic/nupic/data/datasethelpers.py", line 79, in findDataset
    (datasetPath, os.environ.get('NTA_DATA_PATH', '')))
Exception: Unable to locate: rc-center-hourly.csv using NTA_DATA_PATH of

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

I can second this. I have tried to run the sine example with a brand new installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the JSON parser error. I have added some debugging and it boils down to jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing  313 lines...
Writing  113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing  313 lines...
Writing  113 lines...
done.
None
json.loads(jobInfo.results) raised an exception.  Here is some info to help with debugging:
jobInfo:  _jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": ["*"]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results:  None
EXCEPTION:  expected string or buffer
Traceback (most recent call last):
  File "sine_experiment.py", line 104, in <module>
    run_sine_experiment()
  File "sine_experiment.py", line 76, in run_sine_experiment
    model_params = swarm_over_data()
  File "sine_experiment.py", line 69, in swarm_over_data
    {'maxWorkers': 8, 'overwrite': True})
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
    return _runAction(runOptions)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
    results = json.loads(jobInfo.results)
  File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
    json.loads(s, object_hook=objectDecoderHook, **kwargs))
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 351, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

@passiweinberger
Copy link
Member

I am getting the exact same one on ubuntu 14.02LTS. Reinstalled nupic and still get it.

-------- Original message --------
From: Pascal Ehlert notifications@github.com
Date:03/24/2015 18:51 (GMT+01:00)
To: numenta/nupic nupic@noreply.github.com
Subject: Re: [nupic] Expose swarming job errors and stacktraces when they fail (#1815)
I can second this. I have tried to run the sine example with a brand new installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the JSON parser error. I have added some debugging and it boils down to jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer

Reply to this email directly or view it on GitHub.

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

I can also confirm that this happens on Ubuntu 13.10. Installing 14.10 right now to check.

@passiweinberger
Copy link
Member

Has probab more to do with json... is there older alternatives?  Btw. Hey Pascal , regards from Pascal :P

-------- Original message --------
From: Pascal Ehlert notifications@github.com
Date:03/24/2015 19:05 (GMT+01:00)
To: numenta/nupic nupic@noreply.github.com
Cc: Pascal Weinberger passiweinberger@gmail.com
Subject: Re: [nupic] Expose swarming job errors and stacktraces when they fail (#1815)
I can also confirm that this happens on Ubuntu 13.10, am installing 14.10 right now to check.


Reply to this email directly or view it on GitHub.

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

Nope, it's most likely not a json issue. The place where it occurs is when it's parsing the individual swarm job results which are supposed to be in json (as I understand it). However instead of a json string, jobInfo.results evaluates to None. I'd expect a swarm job to have crashed without printing an error message. And cheers to Germany, Mr. Pascal ;-)

@passiweinberger
Copy link
Member

completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, ...
might this cause an error ? (I'm not familiar with json, but all the None values look suspicious.)

@rhyolight
Copy link
Member Author

Guys, I think the issue you're having might have been fixed with #1902, this has not been included in a binary release, so in order to test you'll need to compile locally following the README instructions.

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

This has indeed solved the problem, thank you! Would you mind giving a quick explanation on what happened here?

@rhyolight
Copy link
Member Author

Sure. We were using an old method of finding data files, which did not work when NuPIC was installed from a binary package because all search paths were dependent on environment variables. In the PR I linked above, I ripped all that out and put in place the standard python file packaging method, so that data files packaged within the binary installation can be found.

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

Okay, little throwback here: I tried to run my swarm script again in a new shell and it failed. This only seems to work as long as the NUPIC env variable is set to nupic's build directory.

@breznak
Copy link
Member

breznak commented Mar 24, 2015

On Tue, Mar 24, 2015 at 8:24 PM, Pascal Ehlert notifications@github.com
wrote:

Okay, little throwback here: I tried to run my swarm script again in a new
shell and it failed. This only seems to work as long as the NUPIC env
variable is set to nupic's build directory.

that is the point of NUPIC variable though

@rhyolight
Copy link
Member Author

I don't think users should need to set NUPIC in order to run swarms. The new data lookup procedure should find data that is relative to the current working directory. #1947

@pehlert
Copy link
Contributor

pehlert commented Mar 24, 2015

For everyone who stumbles across this, you can get swarming to work by setting the NUPIC env variable manually to the package location. If you installed nupic via pip, you can simply do this before you run your script:

export NUPIC="$(pip show nupic | grep 'Location:' | sed 's/Location: //')/nupic"

If, like me, you try to run one of the examples out there (e.g. sine prediction or the gym tutorial) under 0.2.1, also note that support for relative file paths in the swarming spec is broken in that version. Use absolute paths instead and you should be fine.

@lovekeshvig
Copy link

This is due to your permissions setting, set yourself as root using sudo -s
and then run

On Tue, Mar 24, 2015 at 11:21 PM, Pascal Ehlert notifications@github.com
wrote:

I can second this. I have tried to run the sine example with a brand new
installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the
JSON parser error. I have added some debugging and it boils down to
jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer


Reply to this email directly or view it on GitHub
#1815 (comment).

Lovekesh Vig
Assistant Professor
School of Computational and Integrative Sciences
Jawaharlal Nehru University

@rhyolight
Copy link
Member Author

Thanks guys.

Sent from my MegaPhone

On Mar 24, 2015, at 8:51 PM, lovekeshvig notifications@github.com wrote:

This is due to your permissions setting, set yourself as root using sudo -s
and then run

On Tue, Mar 24, 2015 at 11:21 PM, Pascal Ehlert notifications@github.com
wrote:

I can second this. I have tried to run the sine example with a brand new
installation of nupic (via pip) on OS X 10.10, and all I'm seeing is the
JSON parser error. I have added some debugging and it boils down to
jobInfo.results being None:

~/tmp/sine% python sine_experiment.py
Generating sine data into sine.csv
Generated 3000 rows of output data into sine.csv
Generating experiment files in directory: /Users/pascal/tmp/sine...
Writing 313 lines...
Writing 113 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1029
Evaluated 0 models
HyperSearch finished!
Worker completion message: None

Results from all experiments:

Generating experiment files in directory: /var/folders/41/c8y1r3yd2z50xk9fj4w1zmy40000gn/T/tmp4NLDtw...
Writing 313 lines...
Writing 113 lines...
done.
None
json.loads(jobInfo.results) raised an exception. Here is some info to help with debugging:
jobInfo: jobInfoNamedTuple(jobId=1029, client=u'GRP', clientInfo=u'', clientKey=u'', cmdLine=u'$HYPERSEARCH', params=u'{"hsVersion": "v2", "maxModels": null, "persistentJobGUID": "7b545619-d24d-11e4-b6fe-600308a458fa", "useTerminators": false, "description": {"inferenceType": "TemporalAnomaly", "includedFields": [{"maxValue": 1.0, "fieldName": "sine", "fieldType": "float", "minValue": -1.0}], "inferenceArgs": {"predictionSteps": [1], "predictedField": "sine"}, "streamDef": {"info": "sine", "version": 1, "streams": [{"info": "sine.csv", "source": "file://sine.csv", "columns": [""]}]}, "swarmSize": "medium"}}', jobHash='{U4z\xd2M\x11\xe4\x94\xc4`\x03\x08\xa4X\xfa', status=u'notStarted', completionReason=None, completionMsg=None, workerCompletionReason=u'success', workerCompletionMsg=None, cancel=0, startTime=None, endTime=None, results=None, engJobType=u'hypersearch', minimumWorkers=1, maximumWorkers=8, priority=0, engAllocateNewWorkers=1, engUntendedDeadWorkers=0, numFailedWorkers=
0, lastFailedWorkerErrorMsg=None, engCleaningStatus=u'notdone', genBaseDescription=None, genPermutations=None, engLastUpdateTime=datetime.datetime(2015, 3, 24, 17, 44, 55), engCjmConnId=None, engWorkerState=None, engStatus=None, engModelMilestones=None)
jobInfo.results: None
EXCEPTION: expected string or buffer
Traceback (most recent call last):
File "sine_experiment.py", line 104, in
run_sine_experiment()
File "sine_experiment.py", line 76, in run_sine_experiment
model_params = swarm_over_data()
File "sine_experiment.py", line 69, in swarm_over_data
{'maxWorkers': 8, 'overwrite': True})
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 276, in runWithConfig
return _runAction(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 217, in _runAction
returnValue = _runHyperSearch(runOptions)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 160, in _runHyperSearch
metricsKeys=search.getDiscoveredMetricsKeys())
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/swarming/permutations_runner.py", line 825, in generateReport
results = json.loads(jobInfo.results)
File "/Users/pascal/Library/Python/2.7/lib/python/site-packages/nupic/support/object_json.py", line 163, in loads
json.loads(s, object_hook=objectDecoderHook, *_kwargs))
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/init.py", line 351, in loads
return cls(encoding=encoding, **kw).decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
TypeError: expected string or buffer


Reply to this email directly or view it on GitHub
#1815 (comment).

Lovekesh Vig
Assistant Professor
School of Computational and Integrative Sciences
Jawaharlal Nehru University

Reply to this email directly or view it on GitHub.

@pehlert
Copy link
Contributor

pehlert commented Mar 25, 2015

@lovekeshvig Thank you, but I cannot confirm this. Running as root without having the NUPIC variable set fails just like before.

@passiweinberger
Copy link
Member

This should fix it: #1968

@pehlert
Copy link
Contributor

pehlert commented Apr 4, 2015

The swarming output is suppressed by assigning a temp file to the outputs here: https://github.com/numenta/nupic/blob/master/nupic/swarming/permutations_runner.py#L632

I am currently working on a better solution, let me know if you have any suggestions.

@andrewmalta13
Copy link
Contributor

andrewmalta13 commented Jun 9, 2016

any way to reliably reproduce this bug?

@rhyolight
Copy link
Member Author

@andrewmalta13 You might try changing a line of the input CSV later in the file to a different data type, like changing a 0.45 to foo. That should throw a runtime error, and that is the type of error that's being suppressed.

@andrewmalta13
Copy link
Contributor

@rhyolight just tried that on examples/opf/clients/hotgym/prediction/one_gym/rec-center-hourly.csv

I changed on of the entries in "kw_energy_consumption" column to "oops" and ran swarm.py.

I received the stack trace:

This script runs a swarm on the input data (rec-center-hourly.csv) and
creates a model parameters file in the `model_params` directory containing
the best model found by the swarm. Dumps a bunch of crud to stdout because
that is just what swarming does at this point. You really don't need to
pay any attention to it.

=================================================
= Swarming on rec-center-hourly data...
= Medium swarm. Sit back and relax, this could take awhile.
=================================================
Generating experiment files in directory: /Users/amalta/nta/nupic/examples/opf/clients/hotgym/prediction/one_gym/swarm...
Writing  313 lines...
Writing  114 lines...
done.
None
Successfully submitted new HyperSearch job, jobID=1060
<jobID: 1060> 6  models finished [success: 0; eof: 0; stopped: 0; killed: 0; ERROR: 6; ORPHANED: 0; unknown: 0]
ERROR MESSAGE: Exception occurred while running model 21816: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops

##>> UPDATED WORKER STATE: 
{   u'activeSwarms': [   u'modelParams|sensorParams|encoders|kw_energy_consumption',
                         u'modelParams|sensorParams|encoders|timestamp_dayOfWeek',
                         u'modelParams|sensorParams|encoders|timestamp_timeOfDay',
                         u'modelParams|sensorParams|encoders|timestamp_weekend'],
    u'blackListedEncoders': [],
    u'lastGoodSprint': None,
    u'lastUpdateTime': 1466191155.69385,
    u'searchOver': False,
    u'sprints': [   {   u'bestErrScore': None,
                        u'bestModelId': None,
                        u'status': u'active'}],
    u'swarms': {   u'modelParams|sensorParams|encoders|kw_energy_consumption': {   u'bestErrScore': None,
                                                                                   u'bestModelId': None,
                                                                                   u'sprintIdx': 0,
                                                                                   u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_dayOfWeek': {   u'bestErrScore': None,
                                                                                 u'bestModelId': None,
                                                                                 u'sprintIdx': 0,
                                                                                 u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_timeOfDay': {   u'bestErrScore': None,
                                                                                 u'bestModelId': None,
                                                                                 u'sprintIdx': 0,
                                                                                 u'status': u'active'},
                   u'modelParams|sensorParams|encoders|timestamp_weekend': {   u'bestErrScore': None,
                                                                               u'bestModelId': None,
                                                                               u'sprintIdx': 0,
                                                                               u'status': u'active'}}}
####>> UPDATED JOB RESULTS: 
{   u'absoluteFieldContributions': {   u'kw_energy_consumption': nan,
                                       u'timestamp_dayOfWeek': nan,
                                       u'timestamp_timeOfDay': nan,
                                       u'timestamp_weekend': nan},
    u'fieldContributions': {   u'kw_energy_consumption': nan,
                               u'timestamp_dayOfWeek': nan,
                               u'timestamp_timeOfDay': nan,
                               u'timestamp_weekend': nan}} (elapsed time: 1.01959 secs)
Evaluated 6 models
HyperSearch finished!
Worker completion message: E10002: Exiting due to receiving too many models failing from exceptions (6 out of 6). 
Model Exception: Exception occurred while running model 21847: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops


Results from all experiments:
----------------------------------------------------------------
Generating experiment files in directory: /var/folders/lm/bgmmckjn0xq4nr9t2tbqj3900000gp/T/tmp3nJ5Se...
Writing  313 lines...
Writing  114 lines...
done.
None
Traceback (most recent call last):
  File "swarm.py", line 109, in <module>
    swarm(INPUT_FILE)
  File "swarm.py", line 101, in swarm
    modelParams = swarmForBestModelParams(SWARM_DESCRIPTION, name)
  File "swarm.py", line 78, in swarmForBestModelParams
    verbosity=0
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 277, in runWithConfig
    return _runAction(runOptions)
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 218, in _runAction
    returnValue = _runHyperSearch(runOptions)
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 161, in _runHyperSearch
    metricsKeys=search.getDiscoveredMetricsKeys())
  File "/Users/amalta/nta/nupic/src/nupic/swarming/permutations_runner.py", line 825, in generateReport
    raise Exception(jobInfo.workerCompletionMsg)
Exception: E10002: Exiting due to receiving too many models failing from exceptions (6 out of 6). 
Model Exception: Exception occurred while running model 21847: ValueError('could not convert string to float: oops',) (<type 'exceptions.ValueError'>)
Traceback (most recent call last):
  File "/Users/amalta/nta/nupic/src/nupic/swarming/hypersearch/utils.py", line 435, in runModelGivenBaseAndParams
    (completionReason, completionMsg) = runner.run()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 241, in run
    fieldStats = self._getFieldStats()
  File "/Users/amalta/nta/nupic/src/nupic/swarming/ModelRunner.py", line 546, in _getFieldStats
    curStats['min'] = self._inputSource.getFieldMin(field)
  File "/Users/amalta/nta/nupic/src/nupic/data/record_stream.py", line 372, in getFieldMin
    stats = self.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/stream_reader.py", line 497, in getStats
    recordStoreStats = self._recordStore.getStats()
  File "/Users/amalta/nta/nupic/src/nupic/data/file_record_stream.py", line 541, in getStats
    value = self._adapters[i](f)
  File "/Users/amalta/nta/nupic/src/nupic/data/utils.py", line 88, in floatOrNone
    return float(f)
ValueError: could not convert string to float: oops

@rhyolight
Copy link
Member Author

@andrewmalta13 So that did not replicate the problem. Try @lovekeshvig's suggestion above?

@andrewmalta13
Copy link
Contributor

Also reports the error as I would expect:

... (omitted for length)

IOError: [Errno 2] No such file or directory: u'/Users/amalta/nta/nupic/examples/opf/clients/hotgym/prediction/one_gym/rec-center-houry.csv'

@andrewmalta13
Copy link
Contributor

andrewmalta13 commented Jun 20, 2016

@rhyolight are you sure this issue hasn't been addressed? Perhaps by this PR: #2205

@rhyolight
Copy link
Member Author

@andrewmalta13 It was reported on HTM Forum a month ago.

@andrewmalta13
Copy link
Contributor

Huh, strange. I guess I will keep trying to reproduce it.

@rhyolight
Copy link
Member Author

@andrewmalta13 If it gets too tedious, maybe just leave it alone until I get another report of the error, then we can both work with the user getting the error to try and replicate.

@andrewmalta13
Copy link
Contributor

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants