-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pydmap Read #2
Pydmap Read #2
Conversation
Fixed the file reading error with few other error in testing and the dmap file. Added some TODO's to look into for the future redesign.
Made some minor modifications to get pydmap to work with python 3.0. This included using memoryview instead of buffer for unpacking the C structs. Using encoding on chars and strings since parsing needs a byte object. Decoding was also needed for string in unpacking data. Stream test did not work and further cleanup of the code needs to happen including a better framework for testing. Performance testing also is needed to try and speed up the reading of files.
Moved the dmap data structures: - scalars - arrays - record to its own file for modularity. Cleaned up the code: - getter/setter property (pythonic) - use ChainMap for data structure on records for performance - got rid of some unused methods
Moved read and write for damp to io for better package readbility. Previous commit also moved various packages around for better organization and readibility. Started cleaning up code with better documentation, variable names, comments and functionality.
Cleaned up dmap.py some more. The following things have changed: - added several new expections for more specific errors - cleaned up error messages to be more descriptive - renamed variable for better clarity - re-wrote log messages for better readability - some code improvements for modularity and encapsulation - byte checking - format converting More work on the reading procedure then writing procedure needs to be cleaned up still.
Did the final clean up with dmap read. Time to test it!
Restructed the testing suite file directory system and file names for better readability. Also, applied pep8 style to the io.py file and restructured how to store records from a dmap file.
Modified the io.py to change the syntax for implementing the new data structures for dmap record, scalar, and array. Dmap records in DmapRead has also changed to deque for memory efficiency and performance reasons. made some minor modifications to init py for proper importing of the classes.
All test run and pass. Added an extra sanity check for empty files. All that is left is documentation.
Documented DmapRead following numpy style guide for compilation to read the docs.
Documented the dmap data structure file and did some spell checking on io.py as well.
Keith point out some speed difference in backscatter vs. pydarn. Main difference was logging being a large factor to time. Possible solution is logging on a separate thread or let it be an option. Minor improvements were also made to pydarn to improve the overall speed of the code. WARNING: logging is commented out for now.
This commit focused on decreasing the time of the DmapRead functionality as much as possible. One problem, to reach minimum amount of time to read dmap files the logging functionality had to be removed. Further investigation is needed on making this optional. Huge improvements were, removing the amount of calls to other library calls. Some fiddling with unpack and numpy.frombuffer. Comments are left in the code indicating non-pythonic coding that allowed for a speed performance. This current version has a 42% increase in performance compared to backscatters dmap in python 2.7 and a 55% increase when compared to an earlier version of pydarn's dmap reader when only what was needed to convert backscatters dmap to python 3.0.
Updated the README for pydarn installation. Added unit tests for pydarn data structures. Update documentation in unit tests. Set logging to be disabled and reduced logging information that is not needed given a enough information in the error messages stored in logging.
Updating README to be more clear on package requirements. Also, added numpy in the setup.py install for virtual environments.
I know I'm not listed as a reviewer, but I couldn't help myself. These are mostly nit-picky comments, but hopefully helpful. Do we need to add numpy to I'll see if I can find time tomorrow to give it a whirl. |
@ksterne No worries, more the marrier for reviewing. Plus others might be busy with the Borealis project, and I would like to push this PR ASAP so I can start building off of it. Sorry, where do you see that numpy has been added to this PR? Yes, I must have written that sentence half awake or near the end of the day. I will re-look over it tomorrow. Sounds awesome! Please let me know what you think. Currently, I am working on the DEVELOPERS guide, some extra documentation and how to set up a nice test suite without having large files. |
Will support for grid files be included at some point? Also, it looks like the |
Sorry, there is two branches with the same name. Please make sure you are using the one that has the complete README file that runs with the following test code I put in the description. @egthomas you are using/looking at the dmapwrite code which is not the scope of the this PR. I left a note above dmapwrite that I need to clean it up and review it. That is legacy code from backscatter. Sorry for the confusion. |
However, I forgot to add grid files in the unit tests so thank you for pointing that out @egthomas |
Ah my mistake, thanks for the clarification! |
Hey @mts299! This looks great! Most of my comments are minor things like typos. I was able to get the example code to run. One major comment I didn't include in my review: I couldn't get the testing to work, except for For both
And finally:
|
@asreimer as the description states:
Also, DmapWrite is also not part of this PR so the Will fix |
Hmmm, the edit to my comment didn't save. I noticed your description after I posted that comment about the testing not working and editted my comment accordingly. 2am might have had something to do with it. |
@asreimer Okay, I mean |
Mainly fixed some README typos, test_dmapread.py error and added some setup features for hopefully better installation.
I tested everything using the examples in the README.md and they work fine. Good to go. That setup.cfg and MANIFEST.in are slick. Really nice work @mts299! |
pydarn_logger.error(self.message) | ||
|
||
|
||
class ZeroByteError(Exception): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be renamed to 'InvalidByteError' since its raised if the bytes are 0 or less than zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I like Zero
because it is concise and something may have set it to zero which might not have made the bytes invalid. However, I get what you mean with the less than zero cases, so I added NegativeByteError
which could indicate something really bad happened :/
""" | ||
class DmapArray(NamedTuple): | ||
name: str | ||
value: np.ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never seen a file that has this, but as part of the DMAP spec, its technically possible to store strings or even whole dmap records in a dmap value. I'm not sure if thats an issue if the type here is ndarray, but something to think about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off I really would like to see a DMAP example where DMAP records are stored in records. The Spec may be outdated or have legacy code that I would not like to introduce into newer code. Especially if the legacy introduces potential bugs.
I am more than happy to go back if a given example is present in the current DMAP format.
With that said, ndarrays
can hold any type of object including strings and DmapRecords so there is no actual issue here (The bittersweet of python :p ). It just informs developers that the object field should be a data type of ndarray. Technically if they store a list or something else nothing will really happen because it is python. Just later on in the downstream something might throw a TypeError
but that is the pythonic way.
I think in later python versions type hinting might actually enforce type checking.
dmap_data = dmap_reader.read_records() | ||
|
||
# dmap_data[record number][paramter name].value | ||
print(dmap_data[0]['bmnum'].value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to create a getter or some sort of decorator so that the user doesn't have to explicitly call .value on every data value they want? ex. if i call dmap_data[0]['bm_num'] then under the hood it returns me dmap_data[0]['bm_num'].value by default? I don't think this affects too much, but if you are writing a large script or something you are going to have a lot of extraneous looking code for .value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I like the .value
because it is explicit and shows self-documenting code (best practices). It also keeps things consistent. Especially with the complaints that RST and DaVitpy have gotten with unclear variable names and code.
For example:
for i in range(record[0]['beam'].shape[1] ):
amplify *= record[0]['beam'].value
vs.
for i in range(record[0]['beam'].shape[1] ):
amplify *= record[0]['beam']
The former is clearer and easier to understand at first glance (which is me in debugging mode) and the latter is more confusing or black magic.
Or to avoid extraneous looking code you can just store it in a variable ahead of time and then use it later to make the lines smaller.
For Example:
Plot (record[0][velocity'].value, record[0]['time'].value, default .... plot ... settings)
vs.
velocity_data = record[0][velocity'].value
time_data = record[0]['time'].value
Plot (velocity_data, time_data, default .... plot ... settings)
The latter is also considered cleaner code too. I mean at the end of the day you are going to have to access the record[0][velocity']
to get most of your data (assuming no loops are involved to cut some stuff down) so an extra .value
especially with autocomplete seems okay to me :)
Also, I don't know if you can make a getter function/decorator with a collection containing a collection. You are potentially looking at some overhead with a getter method and that means a decrease in performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I think this is partly a preference thing and I don't think there are necessarily any right or wrong answers here. I see a couple of thumbs ups on my suggestion, so maybe some others can comment on what they think since this is ultimately a decision on top level functionality. Personally, I still stick by my suggestion as a preference provided there is no performance hits.
I don't know if this is the appropriate place to post these links, but in one of the review or code comments I thought I saw the general question "What is dmap format?". Here are a few short guides/tutorials/documentation written by the original developer (Rob Barnes) which may be of some general help: https://superdarn.github.io/rst/superdarn/src.doc/rfc/0006.html |
@egthomas I'm not sure, but with Borealis, we usually use the issue tracker and use the description tags when we ask our questions or comments. "What is dmap format" would be a good issue with the question tag. |
Other than maybe moving that data question to an issue or something, I think everything else is resolved. I can't submit an approval, so I'm just writing it here. |
I don't know where it was in the 100+ comments, but somewhere I'd like to echo Keith's sentiment that ideally the dmap I/O routines should be as general as possible and therefore able to support real-time data streams and files with different formats from the standard SuperDARN rawacf, fitacf, etc. |
@egthomas I was not disagreeing, I have not tested streaming and focused more on file reading. I will be revisiting streaming in the test suite. There is also talk about supporting other file formats. |
@mts299 sorry I wasn't trying to imply anyone was disagreeing, just having trouble keeping up with all of the messages coming through my inbox. This is my last comment on this PR request. |
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
From Issue #2 suggestion, added `__str__` and `__repr__` for better printing of the object that is being used. Only dmap classes were changed because inherited classes will obtain the same attributes for printing of the object but the class name will be adjusted to be same as the actual class and not the dmap classes.
First pydarn functionality!
DmapRead allows for users to read in dmap files and obtain records in a list of dictionaries.
The original code of dmap was taken from https://github.com/SuperDARNCanada/backscatter
and updated to python 3.6 and improved upon for speed and readability.
README is updated for installation notes and example code.
To test:
This PR is to focus on the DmapRead functionality and the README.
Testing suite setup, performance test suite, DEVELOPERS.md, and DmapWrite are for other PRs