Unified dataset IO #13

SharpLonde · 2023-12-31T06:57:14Z

This PR unified dataset IO in the following aspects:

removed "ABACUSDataset". changed the format of preprocess that can be loaded by "DefaultDataset".
ASE trajectory file is now supported as a aviliable "pos_type".
AtomicData_options setting is now moved to info.json.
Different dataset can be accessed by setting different info.json with the same data files now.

…into data_nequip

…into data_dev

floatingCatty · 2024-01-03T10:03:29Z

dptb/data/dataset/_default_dataset.py

        atomic_numbers = np.loadtxt(os.path.join(root, "atomic_numbers.dat"))
-        if len(atomic_numbers.shape) == 1:
+        if atomic_numbers.shape[0] == self.info["natoms"]:


如果number of frames 恰好等于natoms.....

atomic_numbers.dat设计上总是一维的array，文件里每行只有一个int
如果有多个nframes，实际上的长度也总会是nframes x natoms

ok, 那没问题！

floatingCatty · 2024-01-03T15:22:04Z

dptb/data/build.py

+                info = j_loader(os.path.join(root, file, "info.json"))
+                info = normalize_setinfo(info)
+                info_files[file] = info
+            elif public_info is not None:


public info 的作用是，给所有subfolder指定相同的info吗？如果是这边可能要加一些assert限制用户的输入格式，比如排除sub folder的info和public info同时存在

现在public info更多地是作为一个缺省info存在，在subfolder有的情况下，优先用subfolder info

嗯，可以的，但是会引发歧义哈，用户从操作上很难知道用的是哪个，加个log info，或者assert提醒用户一下会好些？

floatingCatty · 2024-01-03T15:29:16Z

dptb/data/interfaces/abacus.py

-                                add_overlap=add_overlap, get_eigenvalues=get_eigenvalues)
-                    h5file_names.append(os.path.join(file, "AtomicData.h5"))
+                    _abacus_parse(folder, 
+                                  os.path.join(preprocess_dir, os.path.basename(folder)), 


这块如果出现folder有重名的现象，在input_dir的不同subfolder里，输出的时候会怎么处理？

就这边basename拿到的，貌似是一个末尾名，但是不同folder的DFT计算结果是比较可能有重名的

folder的要求是总是指向一个包含"OUT.ABACUS"文件夹的文件夹，basename拿到的是folder的名字，主要这里是为了输出的时候保持对对应数据内容的label；
如果确实是处理：

foo/bar/folder1/OUT.ABACUS

foo/bar/folder2/OUT.ABACUS

foo/folder1/OUT.ABACUS
似乎不太会在一次preprocess的output出现1和3的重名情况？

会呀，比如现在foo下面可能有多个温度，foo/T100/frame.0/OUT.ABACUS, foo/T200/frame.0/OUT.ABACUS, 可能每个温度下面都是从一个traj拿出来的，都叫frame0-1-2-3。

了解，我想一下怎么把前面的命名信息加上

* add data * adapt data and nn module of nequip into deeptb * just modify some imports * update torch-geometry * add kpoint eigenvalue support * add support for nested tensor * update * update data and add batchlize hamiltonian * update se3 rotation * update test * update * debug e3 * update hamileig * delete nequip nn and write our own based on PyG * update nn * nn refactor, write hamiltonian and hop function * update sk hamiltonian and onsite function * refactor sktb and add register for descriptor * update param prototype and dptb * refactor index mapping to data transform * debug sktb and e3tb module * finish debuging sk and e3 * update data interfaces * update r2k and transform * remove dash line in file names * fnishied debugging deeptb module * finish debugging hr2hk * update overlap support * update base trainer and example quantities * update build model * update trainer * update pyproject.toml dependencies * update bond reduction and self-interaction * debug nnsk * nnsk run succeed, add from v1 json model * add nnsk test example of AlAs coupond system * Add 'ABACUSDataset' in data module (#9) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * debug new dptb and trainer * debug datasets * pass cmd line train mod to new model and data * add some comments in neighbor_list_and_relative_vec. * add overlap fitting support * update baseline descriptor and debug validationer * update e3deeph module * update deephe3 module * Added ABACUSInMemoryDataset in data module (#11) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * update dataset and add deephdataset * gpu support and debugging * add dptb+nnsk mix model, debugging build, restart * align run.py, test.py, main.py * debugging * final * add new model backbone on allegro * add new e3 embeding and lr schedular * Added `DefaultDataset` (#12) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * Added `DefaultDataset` and unified `ABACUSDataset` * improved DefaultDataset & add `dptb data` entrypoint for preprocess * update `build_dataset` * aggregating new data class * debug plugin savor and support atom specific cutoffs * refactor bond reduction and rme parameterization * add E3 fitting analysis and E3 rescale * update LossAnalysis and e3baseline model * update band calc and debug nnsk add orbitals * update datatype switch * Unified dataset IO (#13) * Prototype code for loading Hamiltonian * add 'ABACUSDataset' in data module * modified "basis.dat" storage & can load overlap * recover some original dataset settings * add ABACUSDataset in init * Add the in memory version of ABACUSDataset * add ABACUSInMemoryDataset in data package * Added `DefaultDataset` and unified `ABACUSDataset` * improved DefaultDataset & add `dptb data` entrypoint for preprocess * update `build_dataset` * update `data` entrypoint * Unified dataset IO & added ASE trajectory support * Add support to save `.pth` files with different `info.json` settings. * Bug fix in dealing with "ase" info. * updated `argcheck` for setinfo. * added setinfo check when building dataset. * file IO improvements * bug fix in loading `info.json` * update e3 descriptor and OrbitalMapper * Bug fix in reading trajectory data (#15) * add comment and complete eig loss * update new embedding and dependencies * New version of `E3statistics` (#17) * new version of `E3statistics` function added in DefaultDataset. * fix bug in dealing with scalars in `E3statistics` * add "decay" option in E3statistics to return edge length dependence * fix bug in getting rmes when doing stat & update argcheck * adding statistics initialization * debug nnsk batchlization and eigenvalues loading * debug nnsk * optimizing saving best checkpoint * Pr/44 (#19) * add comments QG * add comment QG * debug nnsk add orbital and strain * update `.npy` files loading procedure in DefaultDataset (#18) * optimizing init and restart param loading * update nnsk push thr * update mix model param and deeptb sktb param * BUG FIX in loading `kpoints.npy` files with `ndim==3` (#20) * bug fix in loading `kpoints.npy` files with `ndim==3` * added tests for nnsk training * main program for test_train * refactor test * update nrl * denote run --------- Co-authored-by: Sharp Londe <93334987+SharpLonde@users.noreply.github.com> Co-authored-by: qqgu <guqq_phy@qq.com> Co-authored-by: Qiangqiang Gu <98570179+QG-phy@users.noreply.github.com>

SharpLonde added 29 commits November 15, 2023 15:09

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

be204cd

…into data_nequip

Prototype code for loading Hamiltonian

0ff3ef4

add 'ABACUSDataset' in data module

85016cd

modified "basis.dat" storage & can load overlap

45651e2

recover some original dataset settings

dd99f1c

add ABACUSDataset in init

f6279e8

Merge 'data_nequip' into my repo 'data_dev'

3abbf2d

update data_dev

65f2242

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

215c028

…into data_dev

Add the in memory version of ABACUSDataset

72803a6

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

a829c42

…into data_dev

add ABACUSInMemoryDataset in data package

cdb1a55

Merge branch 'data_nequip' into data_dev

657fa3d

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

885045d

…into data_dev

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

051373c

…into data_dev

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

bae91c4

…into data_dev

Added DefaultDataset and unified ABACUSDataset

6e20eae

improved DefaultDataset & add dptb data entrypoint for preprocess

a23511d

sync data_nequip code

aa5f8a3

update build_dataset

67d7ba4

update data entrypoint

12bedfd

Merge branch 'data_nequip' & update dataset support

ac0df30

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

4f8521b

…into data_dev

Merge branch 'data_nequip' of https://github.com/floatingCatty/DeePTB …

39240ff

…into data_dev

Unified dataset IO & added ASE trajectory support

0819a9e

Add support to save .pth files with different info.json settings.

7aba377

Bug fix in dealing with "ase" info.

4a6e47d

updated argcheck for setinfo.

6fd2379

added setinfo check when building dataset.

0010c7a

floatingCatty reviewed Jan 3, 2024

View reviewed changes

SharpLonde added 2 commits January 4, 2024 17:33

file IO improvements

64ae33a

bug fix in loading info.json

5d45e18

floatingCatty approved these changes Jan 4, 2024

View reviewed changes

floatingCatty merged commit f84c016 into floatingCatty:data_nequip Jan 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified dataset IO #13

Unified dataset IO #13

SharpLonde commented Dec 31, 2023 •

edited

Loading

floatingCatty Jan 3, 2024

SharpLonde Jan 4, 2024

floatingCatty Jan 4, 2024

floatingCatty Jan 3, 2024

SharpLonde Jan 4, 2024

floatingCatty Jan 4, 2024

floatingCatty Jan 3, 2024

floatingCatty Jan 3, 2024

SharpLonde Jan 4, 2024

floatingCatty Jan 4, 2024

SharpLonde Jan 4, 2024

Unified dataset IO #13

Unified dataset IO #13

Conversation

SharpLonde commented Dec 31, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SharpLonde commented Dec 31, 2023 •

edited

Loading