Remove corrupted lines in xvg files #126

hannahbaumann · 2021-05-03T21:27:56Z

Hi,

when I want to analyze free energy differences while simulations are still running, the last line of the xvg files is often corrupted (not fully written yet) and alchemlyb fails to do the analysis. Alchemical analysis has a feature that repairs those files, so I usually run that first and then run alchemlyb on the repaired xvg files. Is it possible to move that feature into alchemlyb?
https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py

orbeckst · 2021-05-03T23:59:07Z

@hannahbaumann how would you want this feature to work, if it were in alchemlyb? Can you outline Python code?

hannahbaumann · 2021-05-04T00:21:32Z

I think for right now it would be enough if it checks whether the length of the last line is correct and that it removes the last line if it's too short. Similar to the def removeCorruptLines function in this script in alchemical analysis: https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py
The function gets the length of the data from the .xvg header and then checks the length of (in this case) all lines, but could also just be the last line.
Another issue that I've had in the past was that I accidentally restarted a simulation although it was still running and both simulations appended the data to the same .xvg file, resulting in duplicates in the file. In that case it would be helpful if alchemlyb can detect the duplicates and remove one of them. But I haven't written the code for that scenario yet.

orbeckst · 2021-05-04T00:34:48Z

Do you want the alchemlyb XVG parser to just ignore corrupt lines or do you want the function to be "somewhere" in alchemlyb so that you can import it to use it as part of your workflow? I am trying to gauge where this would fit in.

orbeckst · 2021-08-02T01:04:31Z

The current philosophy of the library is to read data and make them available as dataframes. A function that writes out the data does not fit particularly well into this scheme, I feel. However, we could consider adding a slower XVG parser as an alternative to the fast pandas.read_csv() based one

alchemlyb/src/alchemlyb/parsing/gmx.py

Lines 300 to 302 in b068776

    
           df = pd.read_csv(xvg, sep=r"\s+", header=None, skiprows=header_cnt, 
        
                   na_filter=True, memory_map=True, names=cols, dtype=np.float64, 
        
                   float_precision='high')

(which is fairly well optimized and much faster than the simple Python-based XVG reader that we used previously). We could have an option for the extract_* functions that enables reading of corrupt datafiles. This could then switch to the slow line-by-line parser that could be based on the code https://github.com/MobleyLab/alchemical-analysis/blob/master/alchemical_analysis/utils/corruptxvg.py, with the difference that it needs to produce a dataframe in the same way as the existing code, except that incomplete lines are omitted.

I'd be happy to review a PR based along the lines above.

xiki-tempula · 2021-08-02T09:05:40Z

This is my understanding of this issue. There are two questions raised on this issue.

removes the last line if it's too short.

This is quite easy to solve. The pd.read_csv will give a line full of NaN when the line is not complete. My solution to the Gromacs praser is to add

    # Drop the incomplete rows
    df.dropna(inplace=True)

to

alchemlyb/src/alchemlyb/parsing/gmx.py

Line 306 in b068776

.

The other problem duplicates in the file is solved with alchemlyb.preprocessing.subsampling.statistical_inefficiency(drop_duplicates=True), which will drop the duplications.

The only question is how do we define the boundary of parser and preprocessing. Should the removal of corrupted lines and drop duplication been put in the parser or they should go to the preprocessing, such that parser retain as much original information as possible.

orbeckst · 2021-09-17T00:42:23Z

I would consider it a preprocessing step, like cleaning data.

* Fix #126 and #171 * more robust gmx parser: skip NaN and incomplete lines in XVG files with filter=True; performance seems similar (see PR #183) * filter=True is now DEFAULT * add tests; set older tests to use filter=False for backwards-compatibility * Update CHANGES

orbeckst added the enhancement label Aug 2, 2021

orbeckst added the GROMACS MD engine label Oct 20, 2021

xiki-tempula mentioned this issue Dec 30, 2021

Implement a more robust gmx parser #183

Merged

orbeckst closed this as completed in #183 Apr 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove corrupted lines in xvg files #126

Remove corrupted lines in xvg files #126

hannahbaumann commented May 3, 2021

orbeckst commented May 3, 2021

hannahbaumann commented May 4, 2021

orbeckst commented May 4, 2021

orbeckst commented Aug 2, 2021 •

edited

Loading

xiki-tempula commented Aug 2, 2021

orbeckst commented Sep 17, 2021

Remove corrupted lines in xvg files #126

Remove corrupted lines in xvg files #126

Comments

hannahbaumann commented May 3, 2021

orbeckst commented May 3, 2021

hannahbaumann commented May 4, 2021

orbeckst commented May 4, 2021

orbeckst commented Aug 2, 2021 • edited Loading

xiki-tempula commented Aug 2, 2021

orbeckst commented Sep 17, 2021

orbeckst commented Aug 2, 2021 •

edited

Loading