-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified dataset IO #13
Conversation
…into data_nequip
atomic_numbers = np.loadtxt(os.path.join(root, "atomic_numbers.dat")) | ||
if len(atomic_numbers.shape) == 1: | ||
if atomic_numbers.shape[0] == self.info["natoms"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果number of frames 恰好等于natoms.....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
atomic_numbers.dat设计上总是一维的array,文件里每行只有一个int
如果有多个nframes,实际上的长度也总会是nframes x natoms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, 那没问题!
info = j_loader(os.path.join(root, file, "info.json")) | ||
info = normalize_setinfo(info) | ||
info_files[file] = info | ||
elif public_info is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public info 的作用是,给所有subfolder指定相同的info吗?如果是这边可能要加一些assert限制用户的输入格式,比如排除sub folder的info和public info同时存在
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在public info更多地是作为一个缺省info存在,在subfolder有的情况下,优先用subfolder info
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,可以的,但是会引发歧义哈,用户从操作上很难知道用的是哪个,加个log info,或者assert提醒用户一下会好些?
dptb/data/interfaces/abacus.py
Outdated
add_overlap=add_overlap, get_eigenvalues=get_eigenvalues) | ||
h5file_names.append(os.path.join(file, "AtomicData.h5")) | ||
_abacus_parse(folder, | ||
os.path.join(preprocess_dir, os.path.basename(folder)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这块如果出现folder有重名的现象,在input_dir的不同subfolder里,输出的时候会怎么处理?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就这边basename拿到的,貌似是一个末尾名,但是不同folder的DFT计算结果是比较可能有重名的
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
folder的要求是总是指向一个包含"OUT.ABACUS"文件夹的文件夹,basename拿到的是folder的名字,主要这里是为了输出的时候保持对对应数据内容的label;
如果确实是处理:
- foo/bar/folder1/OUT.ABACUS
- foo/bar/folder2/OUT.ABACUS
- foo/folder1/OUT.ABACUS
似乎不太会在一次preprocess的output出现1和3的重名情况?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
会呀,比如现在foo下面可能有多个温度,foo/T100/frame.0/OUT.ABACUS, foo/T200/frame.0/OUT.ABACUS, 可能每个温度下面都是从一个traj拿出来的,都叫frame0-1-2-3。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
了解,我想一下怎么把前面的命名信息加上
* add data * adapt data and nn module of nequip into deeptb * just modify some imports * update torch-geometry * add kpoint eigenvalue support * add support for nested tensor * update * update data and add batchlize hamiltonian * update se3 rotation * update test * update * debug e3 * update hamileig * delete nequip nn and write our own based on PyG * update nn * nn refactor, write hamiltonian and hop function * update sk hamiltonian and onsite function * refactor sktb and add register for descriptor * update param prototype and dptb * refactor index mapping to data transform * debug sktb and e3tb module * finish debuging sk and e3 * update data interfaces * update r2k and transform * remove dash line in file names * fnishied debugging deeptb module * finish debugging hr2hk * update overlap support * update base trainer and example quantities * update build model * update trainer * update pyproject.toml dependencies * update bond reduction and self-interaction * debug nnsk * nnsk run succeed, add from v1 json model * add nnsk test example of AlAs coupond system * Add 'ABACUSDataset' in data module (#9) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * debug new dptb and trainer * debug datasets * pass cmd line train mod to new model and data * add some comments in neighbor_list_and_relative_vec. * add overlap fitting support * update baseline descriptor and debug validationer * update e3deeph module * update deephe3 module * Added ABACUSInMemoryDataset in data module (#11) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * update dataset and add deephdataset * gpu support and debugging * add dptb+nnsk mix model, debugging build, restart * align run.py, test.py, main.py * debugging * final * add new model backbone on allegro * add new e3 embeding and lr schedular * Added `DefaultDataset` (#12) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * Added `DefaultDataset` and unified `ABACUSDataset` * improved DefaultDataset & add `dptb data` entrypoint for preprocess * update `build_dataset` * aggregating new data class * debug plugin savor and support atom specific cutoffs * refactor bond reduction and rme parameterization * add E3 fitting analysis and E3 rescale * update LossAnalysis and e3baseline model * update band calc and debug nnsk add orbitals * update datatype switch * Unified dataset IO (#13) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * Added `DefaultDataset` and unified `ABACUSDataset` * improved DefaultDataset & add `dptb data` entrypoint for preprocess * update `build_dataset` * update `data` entrypoint * Unified dataset IO & added ASE trajectory support * Add support to save `.pth` files with different `info.json` settings. * Bug fix in dealing with "ase" info. * updated `argcheck` for setinfo. * added setinfo check when building dataset. * file IO improvements * bug fix in loading `info.json` * update e3 descriptor and OrbitalMapper * Bug fix in reading trajectory data (#15) * add comment and complete eig loss * update new embedding and dependencies * New version of `E3statistics` (#17) * new version of `E3statistics` function added in DefaultDataset. * fix bug in dealing with scalars in `E3statistics` * add "decay" option in E3statistics to return edge length dependence * fix bug in getting rmes when doing stat & update argcheck * adding statistics initialization * debug nnsk batchlization and eigenvalues loading * debug nnsk * optimizing saving best checkpoint * Pr/44 (#19) * add comments QG * add comment QG * debug nnsk add orbital and strain * update `.npy` files loading procedure in DefaultDataset (#18) * optimizing init and restart param loading * update nnsk push thr * update mix model param and deeptb sktb param * BUG FIX in loading `kpoints.npy` files with `ndim==3` (#20) * bug fix in loading `kpoints.npy` files with `ndim==3` * added tests for nnsk training * main program for test_train * refactor test * update nrl * denote run --------- Co-authored-by: Sharp Londe <93334987+SharpLonde@users.noreply.github.com> Co-authored-by: qqgu <guqq_phy@qq.com> Co-authored-by: Qiangqiang Gu <98570179+QG-phy@users.noreply.github.com>
This PR unified dataset IO in the following aspects:
AtomicData_options
setting is now moved toinfo.json
.info.json
with the same data files now.