Skip to content

Commit

Permalink
Mergev2 merge data nequip refactoring into main (#61)
Browse files Browse the repository at this point in the history
* add data

* adapt data and nn module of nequip into deeptb

* just modify some imports

* update torch-geometry

* add kpoint eigenvalue support

* add support for nested tensor

* update

* update data and add batchlize hamiltonian

* update se3 rotation

* update test

* update

* debug e3

* update hamileig

* delete nequip nn and write our own based on PyG

* update nn

* nn refactor, write hamiltonian and hop function

* update sk hamiltonian and onsite function

* refactor sktb and add register for descriptor

* update param prototype and dptb

* refactor index mapping to data transform

* debug sktb and e3tb module

* finish debuging sk and e3

* update data interfaces

* update r2k and transform

* remove dash line in file names

* fnishied debugging deeptb module

* finish debugging hr2hk

* update overlap support

* update base trainer and example quantities

* update build model

* update trainer

* update pyproject.toml dependencies

* update bond reduction and self-interaction

* debug nnsk

* nnsk run succeed, add from v1 json model

* add nnsk test example of AlAs coupond system

* Add 'ABACUSDataset' in data module (#9)

* Prototype code for loading Hamiltonian

* add 'ABACUSDataset' in data module

* modified "basis.dat" storage & can load overlap

* recover some original dataset settings

* add ABACUSDataset in init

* debug new dptb and trainer

* debug datasets

* pass cmd line train mod to new model and data

* add some comments in neighbor_list_and_relative_vec.

* add overlap fitting support

* update baseline descriptor and debug validationer

* update e3deeph module

* update deephe3 module

* Added ABACUSInMemoryDataset in data module (#11)

* Prototype code for loading Hamiltonian

* add 'ABACUSDataset' in data module

* modified "basis.dat" storage & can load overlap

* recover some original dataset settings

* add ABACUSDataset in init

* Add the in memory version of ABACUSDataset

* add ABACUSInMemoryDataset in data package

* update dataset and add deephdataset

* gpu support and debugging

* add dptb+nnsk mix model, debugging build, restart

* align run.py, test.py, main.py

* debugging

* final

* add new model backbone on allegro

* add new e3 embeding and lr schedular

* Added `DefaultDataset` (#12)

* Prototype code for loading Hamiltonian

* add 'ABACUSDataset' in data module

* modified "basis.dat" storage & can load overlap

* recover some original dataset settings

* add ABACUSDataset in init

* Add the in memory version of ABACUSDataset

* add ABACUSInMemoryDataset in data package

* Added `DefaultDataset` and unified `ABACUSDataset`

* improved DefaultDataset & add `dptb data` entrypoint for preprocess

* update `build_dataset`

* aggregating new data class

* debug plugin savor and support atom specific cutoffs

* refactor bond reduction and rme parameterization

* add E3 fitting analysis and E3 rescale

* update LossAnalysis and e3baseline model

* update band calc and debug nnsk add orbitals

* update datatype switch

* Unified dataset IO (#13)

* Prototype code for loading Hamiltonian

* add 'ABACUSDataset' in data module

* modified "basis.dat" storage & can load overlap

* recover some original dataset settings

* add ABACUSDataset in init

* Add the in memory version of ABACUSDataset

* add ABACUSInMemoryDataset in data package

* Added `DefaultDataset` and unified `ABACUSDataset`

* improved DefaultDataset & add `dptb data` entrypoint for preprocess

* update `build_dataset`

* update `data` entrypoint

* Unified dataset IO & added ASE trajectory support

* Add support to save `.pth` files with different `info.json` settings.

* Bug fix in dealing with "ase" info.

* updated `argcheck` for setinfo.

* added setinfo check when building dataset.

* file IO improvements

* bug fix in loading `info.json`

* update e3 descriptor and OrbitalMapper

* Bug fix in reading trajectory data (#15)

* add comment and complete eig loss

* update new embedding and dependencies

* New version of `E3statistics` (#17)

* new version of `E3statistics` function added in DefaultDataset.

* fix bug in dealing with scalars in `E3statistics`

* add "decay" option in E3statistics to return edge length dependence

* fix bug in getting rmes when doing stat & update argcheck

* adding statistics initialization

* debug nnsk batchlization and eigenvalues loading

* debug nnsk

* optimizing saving best checkpoint

* Pr/44 (#19)

* add comments QG

* add comment QG

* debug nnsk add orbital and strain

* update `.npy` files loading procedure in DefaultDataset (#18)

* optimizing init and restart param loading

* update nnsk push thr

* update mix model param and deeptb sktb param

* BUG FIX in loading `kpoints.npy` files with `ndim==3` (#20)

* bug fix in loading `kpoints.npy` files with `ndim==3`

* added tests for nnsk training

* main program for test_train

* refactor test

* update nrl

* denote run

---------

Co-authored-by: Sharp Londe <93334987+SharpLonde@users.noreply.github.com>
Co-authored-by: qqgu <guqq_phy@qq.com>
Co-authored-by: Qiangqiang Gu <98570179+QG-phy@users.noreply.github.com>
  • Loading branch information
4 people authored Feb 2, 2024
1 parent 7873b47 commit 7377d96
Show file tree
Hide file tree
Showing 202 changed files with 26,503 additions and 2,137 deletions.
993 changes: 993 additions & 0 deletions dptb/data/AtomicData.py

Large diffs are not rendered by default.

233 changes: 233 additions & 0 deletions dptb/data/AtomicDataDict.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,233 @@
"""nequip.data.jit: TorchScript functions for dealing with AtomicData.
These TorchScript functions operate on ``Dict[str, torch.Tensor]`` representations
of the ``AtomicData`` class which are produced by ``AtomicData.to_AtomicDataDict()``.
Authors: Albert Musaelian
"""
from typing import Dict, Any

import torch
import torch.jit

from e3nn import o3

# Make the keys available in this module
from ._keys import * # noqa: F403, F401

# Also import the module to use in TorchScript, this is a hack to avoid bug:
# https://github.com/pytorch/pytorch/issues/52312
from . import _keys

# Define a type alias
Type = Dict[str, torch.Tensor]


def validate_keys(keys, graph_required=True):
# Validate combinations
if graph_required:
if not (_keys.POSITIONS_KEY in keys and _keys.EDGE_INDEX_KEY in keys):
raise KeyError("At least pos and edge_index must be supplied")
if _keys.EDGE_CELL_SHIFT_KEY in keys and "cell" not in keys:
raise ValueError("If `edge_cell_shift` given, `cell` must be given.")


_SPECIAL_IRREPS = [None]


def _fix_irreps_dict(d: Dict[str, Any]):
return {k: (i if i in _SPECIAL_IRREPS else o3.Irreps(i)) for k, i in d.items()}


def _irreps_compatible(ir1: Dict[str, o3.Irreps], ir2: Dict[str, o3.Irreps]):
return all(ir1[k] == ir2[k] for k in ir1 if k in ir2)


@torch.jit.script
def with_edge_vectors(data: Type, with_lengths: bool = True) -> Type:
"""Compute the edge displacement vectors for a graph.
If ``data.pos.requires_grad`` and/or ``data.cell.requires_grad``, this
method will return edge vectors correctly connected in the autograd graph.
Returns:
Tensor [n_edges, 3] edge displacement vectors
"""
if _keys.EDGE_VECTORS_KEY in data:
if with_lengths and _keys.EDGE_LENGTH_KEY not in data:
data[_keys.EDGE_LENGTH_KEY] = torch.linalg.norm(
data[_keys.EDGE_VECTORS_KEY], dim=-1
)

return data
else:
# Build it dynamically
# Note that this is
# (1) backwardable, because everything (pos, cell, shifts)
# is Tensors.
# (2) works on a Batch constructed from AtomicData
pos = data[_keys.POSITIONS_KEY]
edge_index = data[_keys.EDGE_INDEX_KEY]
edge_vec = pos[edge_index[1]] - pos[edge_index[0]]
if _keys.CELL_KEY in data:
# ^ note that to save time we don't check that the edge_cell_shifts are trivial if no cell is provided; we just assume they are either not present or all zero.
# -1 gives a batch dim no matter what
cell = data[_keys.CELL_KEY].view(-1, 3, 3)
edge_cell_shift = data[_keys.EDGE_CELL_SHIFT_KEY]
if cell.shape[0] > 1:
batch = data[_keys.BATCH_KEY]
# Cell has a batch dimension
# note the ASE cell vectors as rows convention
edge_vec = edge_vec + torch.einsum(
"ni,nij->nj", edge_cell_shift, cell[batch[edge_index[0]]]
)
# TODO: is there a more efficient way to do the above without
# creating an [n_edge] and [n_edge, 3, 3] tensor?
else:
# Cell has either no batch dimension, or a useless one,
# so we can avoid creating the large intermediate cell tensor.
# Note that we do NOT check that the batch array, if it is present,
# is trivial — but this does need to be consistent.
edge_vec = edge_vec + torch.einsum(
"ni,ij->nj",
edge_cell_shift,
cell.squeeze(0), # remove batch dimension
)

data[_keys.EDGE_VECTORS_KEY] = edge_vec
if with_lengths:
data[_keys.EDGE_LENGTH_KEY] = torch.linalg.norm(edge_vec, dim=-1)
return data

@torch.jit.script
def with_env_vectors(data: Type, with_lengths: bool = True) -> Type:
"""Compute the edge displacement vectors for a graph.
If ``data.pos.requires_grad`` and/or ``data.cell.requires_grad``, this
method will return edge vectors correctly connected in the autograd graph.
Returns:
Tensor [n_edges, 3] edge displacement vectors
"""
if _keys.ENV_VECTORS_KEY in data:
if with_lengths and _keys.ENV_LENGTH_KEY not in data:
data[_keys.ENV_LENGTH_KEY] = torch.linalg.norm(
data[_keys.ENV_VECTORS_KEY], dim=-1
)
return data
else:
# Build it dynamically
# Note that this is
# (1) backwardable, because everything (pos, cell, shifts)
# is Tensors.
# (2) works on a Batch constructed from AtomicData
pos = data[_keys.POSITIONS_KEY]
env_index = data[_keys.ENV_INDEX_KEY]
env_vec = pos[env_index[1]] - pos[env_index[0]]
if _keys.CELL_KEY in data:
# ^ note that to save time we don't check that the edge_cell_shifts are trivial if no cell is provided; we just assume they are either not present or all zero.
# -1 gives a batch dim no matter what
cell = data[_keys.CELL_KEY].view(-1, 3, 3)
env_cell_shift = data[_keys.ENV_CELL_SHIFT_KEY]
if cell.shape[0] > 1:
batch = data[_keys.BATCH_KEY]
# Cell has a batch dimension
# note the ASE cell vectors as rows convention
env_vec = env_vec + torch.einsum(
"ni,nij->nj", env_cell_shift, cell[batch[env_index[0]]]
)
# TODO: is there a more efficient way to do the above without
# creating an [n_edge] and [n_edge, 3, 3] tensor?
else:
# Cell has either no batch dimension, or a useless one,
# so we can avoid creating the large intermediate cell tensor.
# Note that we do NOT check that the batch array, if it is present,
# is trivial — but this does need to be consistent.
env_vec = env_vec + torch.einsum(
"ni,ij->nj",
env_cell_shift,
cell.squeeze(0), # remove batch dimension
)
data[_keys.ENV_VECTORS_KEY] = env_vec
if with_lengths:
data[_keys.ENV_LENGTH_KEY] = torch.linalg.norm(env_vec, dim=-1)
return data

@torch.jit.script
def with_onsitenv_vectors(data: Type, with_lengths: bool = True) -> Type:
"""Compute the edge displacement vectors for a graph.
If ``data.pos.requires_grad`` and/or ``data.cell.requires_grad``, this
method will return edge vectors correctly connected in the autograd graph.
Returns:
Tensor [n_edges, 3] edge displacement vectors
"""
if _keys.ONSITENV_VECTORS_KEY in data:
if with_lengths and _keys.ONSITENV_LENGTH_KEY not in data:
data[_keys.ONSITENV_LENGTH_KEY] = torch.linalg.norm(
data[_keys.ONSITENV_VECTORS_KEY], dim=-1
)
return data
else:
# Build it dynamically
# Note that this is
# (1) backwardable, because everything (pos, cell, shifts)
# is Tensors.
# (2) works on a Batch constructed from AtomicData
pos = data[_keys.POSITIONS_KEY]
env_index = data[_keys.ONSITENV_INDEX_KEY]
env_vec = pos[env_index[1]] - pos[env_index[0]]
if _keys.CELL_KEY in data:
# ^ note that to save time we don't check that the edge_cell_shifts are trivial if no cell is provided; we just assume they are either not present or all zero.
# -1 gives a batch dim no matter what
cell = data[_keys.CELL_KEY].view(-1, 3, 3)
env_cell_shift = data[_keys.ONSITENV_CELL_SHIFT_KEY]
if cell.shape[0] > 1:
batch = data[_keys.BATCH_KEY]
# Cell has a batch dimension
# note the ASE cell vectors as rows convention
env_vec = env_vec + torch.einsum(
"ni,nij->nj", env_cell_shift, cell[batch[env_index[0]]]
)
# TODO: is there a more efficient way to do the above without
# creating an [n_edge] and [n_edge, 3, 3] tensor?
else:
# Cell has either no batch dimension, or a useless one,
# so we can avoid creating the large intermediate cell tensor.
# Note that we do NOT check that the batch array, if it is present,
# is trivial — but this does need to be consistent.
env_vec = env_vec + torch.einsum(
"ni,ij->nj",
env_cell_shift,
cell.squeeze(0), # remove batch dimension
)
data[_keys.ONSITENV_VECTORS_KEY] = env_vec
if with_lengths:
data[_keys.ONSITENV_LENGTH_KEY] = torch.linalg.norm(env_vec, dim=-1)
return data


@torch.jit.script
def with_batch(data: Type) -> Type:
"""Get batch Tensor.
If this AtomicDataPrimitive has no ``batch``, one of all zeros will be
allocated and returned.
"""
if _keys.BATCH_KEY in data:
return data
else:
pos = data[_keys.POSITIONS_KEY]
batch = torch.zeros(len(pos), dtype=torch.long, device=pos.device)
data[_keys.BATCH_KEY] = batch
# ugly way to make a tensor of [0, len(pos)], but it avoids transfers or casts
data[_keys.BATCH_PTR_KEY] = torch.arange(
start=0,
end=len(pos) + 1,
step=len(pos),
dtype=torch.long,
device=pos.device,
)

return data
49 changes: 49 additions & 0 deletions dptb/data/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from .AtomicData import (
AtomicData,
PBC,
register_fields,
deregister_fields,
_register_field_prefix,
_NODE_FIELDS,
_EDGE_FIELDS,
_GRAPH_FIELDS,
_LONG_FIELDS,
)
from .dataset import (
AtomicDataset,
AtomicInMemoryDataset,
NpzDataset,
ASEDataset,
HDF5Dataset,
ABACUSDataset,
ABACUSInMemoryDataset,
DefaultDataset
)
from .dataloader import DataLoader, Collater, PartialSampler
from .build import dataset_from_config
from .test_data import EMTTestDataset

__all__ = [
AtomicData,
PBC,
register_fields,
deregister_fields,
_register_field_prefix,
AtomicDataset,
AtomicInMemoryDataset,
NpzDataset,
ASEDataset,
HDF5Dataset,
ABACUSDataset,
ABACUSInMemoryDataset,
DefaultDataset,
DataLoader,
Collater,
PartialSampler,
dataset_from_config,
_NODE_FIELDS,
_EDGE_FIELDS,
_GRAPH_FIELDS,
_LONG_FIELDS,
EMTTestDataset,
]
Loading

0 comments on commit 7377d96

Please sign in to comment.