Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

one_embedding add doc string #7902

Merged
merged 35 commits into from
Apr 9, 2022
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
712025b
add doc string
guo-ran Mar 25, 2022
888472d
add example
guo-ran Mar 31, 2022
43fc23d
add
guo-ran Mar 31, 2022
5ce3511
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Mar 31, 2022
ea465c8
Merge branch 'dev_one_embedding_add_doc' of https://github.com/Oneflo…
guo-ran Apr 1, 2022
9b4a0a5
fix doc
guo-ran Apr 1, 2022
9c5b98d
Merge branch 'dev_one_embedding_add_doc' of work24-in:/home/guoran/gi…
guo-ran Apr 1, 2022
5a1c96e
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Apr 1, 2022
9af820b
refine
guo-ran Apr 2, 2022
f3be202
Merge branch 'dev_one_embedding_add_doc' of work24-in:/home/guoran/gi…
guo-ran Apr 2, 2022
5fec2a5
Merge branch 'dev_one_embedding_add_doc' of https://github.com/Oneflo…
guo-ran Apr 2, 2022
e516292
address review
guo-ran Apr 6, 2022
7bd50ec
Merge branch 'dev_one_embedding_add_doc' of /home/guoran/git_repo/one…
guo-ran Apr 6, 2022
a337604
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Apr 6, 2022
9234745
mb to MB
guo-ran Apr 6, 2022
e69940c
Merge branch 'dev_one_embedding_add_doc' of https://github.com/Oneflo…
guo-ran Apr 6, 2022
c98545d
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Apr 6, 2022
4cb08b6
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Apr 7, 2022
1aa81e6
add make_table_option
guo-ran Apr 7, 2022
ba55ae2
option to options
guo-ran Apr 7, 2022
8a93681
refine
guo-ran Apr 7, 2022
314a06e
Merge branch 'dev_one_embedding_add_doc' of work24-in:/home/guoran/gi…
guo-ran Apr 7, 2022
3ef7de3
Merge branch 'dev_one_embedding_add_doc' of https://github.com/Oneflo…
guo-ran Apr 7, 2022
b240ea2
Merge branch 'master' into dev_one_embedding_add_doc
guo-ran Apr 7, 2022
2ea5dec
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 7, 2022
a3f95c9
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 7, 2022
b141e64
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 7, 2022
7924a87
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 7, 2022
5b8cbb9
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 8, 2022
eeb8f6e
add forward
guo-ran Apr 8, 2022
14ece2d
Merge branch 'dev_one_embedding_add_doc' of work24:/home/guoran/git_r…
guo-ran Apr 8, 2022
94a94b0
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 8, 2022
93d49a1
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 8, 2022
6187faf
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 8, 2022
fc5fff3
Merge branch 'master' into dev_one_embedding_add_doc
mergify[bot] Apr 9, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ OneFlow API Reference
utils
env
comm
one_embedding



Expand Down
14 changes: 14 additions & 0 deletions docs/source/one_embedding.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
oneflow.one_embedding
===================================
OneFlow one_embedding operations.
----------------------------------
.. currentmodule:: oneflow.one_embedding
.. automodule:: oneflow.one_embedding
:members: MultiTableEmbedding,

.. autofunction:: oneflow.one_embedding.make_device_mem_store_options
.. autofunction:: oneflow.one_embedding.make_cached_ssd_store_options
.. autofunction:: oneflow.one_embedding.make_cached_host_mem_store_options
.. autofunction:: oneflow.one_embedding.make_uniform_initializer
.. autofunction:: oneflow.one_embedding.make_normal_initializer
.. autofunction:: oneflow.one_embedding.make_table
212 changes: 212 additions & 0 deletions python/oneflow/one_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,81 @@ def _check_cache(cache):


class MultiTableEmbedding(Module):
r"""MultiTableEmbedding represent multi Embedding tables with same embedding_dim, dtype, and key_type.

Args:
name (str): The name of Embedding
embedding_dim (int): the size of each embedding vector
dtype (flow.dtype): the data type of embeddings
key_type (flow.dtype): the data type of feature ids
tables (list): list of table param which can be made by flow.one_embedding.make_table
store_options (dict): store option of Embedding
default_initializer (dict, optional): if tables param is None, use default_initializer to initialize table. Defaults to None.

For example:

.. code-block:: python

>>> import oneflow as flow
>>> import numpy as np
>>> import oneflow.nn as nn
>>> # a simple example with 3 table
>>> table_size_array = [39884407, 39043, 17289]
>>> vocab_size = sum(table_size_array)
>>> num_tables = len(table_size_array)
>>> embedding_size = 128
>>> scales = np.sqrt(1 / np.array(table_size_array))
>>> tables = [
>>> flow.one_embedding.make_table(
>>> flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
>>> )
>>> for scale in scales
>>> ]
>>> store_options = flow.one_embedding.make_cached_ssd_store_options(
>>> cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size,
>>> )
>>> embedding = flow.one_embedding.MultiTableEmbedding(
>>> name="my_embedding",
>>> embedding_dim=embedding_size,
>>> dtype=flow.float,
>>> key_type=flow.int64,
>>> tables=tables,
>>> store_options=store_options,
>>> )
>>> embedding.to("cuda")
>>> mlp = flow.nn.FusedMLP(
>>> in_features=embedding_size * num_tables,
>>> hidden_features=[512, 256, 128],
>>> out_features=1,
>>> skip_final_activation=True,
>>> )
>>> mlp.to("cuda")
>>>
>>> class TrainGraph(flow.nn.Graph):
>>> def __init__(self,):
>>> super().__init__()
>>> self.embedding_lookup = embedding
>>> self.mlp = mlp
>>> self.add_optimizer(
>>> flow.optim.SGD(self.embedding_lookup.parameters(), lr=0.1, momentum=0.0)
>>> )
>>> self.add_optimizer(
>>> flow.optim.SGD(self.mlp.parameters(), lr=0.1, momentum=0.0)
>>> )
>>> def build(self, ids):
>>> embedding = self.embedding_lookup(ids)
>>> loss = self.mlp(flow.reshape(embedding, (-1, num_tables * embedding_size)))
>>> loss = loss.sum()
>>> loss.backward()
>>> return loss
>>> ids = np.random.randint(0, 1000, (100, num_tables), dtype=np.int64)
>>> ids_tensor = flow.tensor(ids, requires_grad=False).to("cuda")
>>> graph = TrainGraph()
>>> loss = graph(ids_tensor)
>>> print(loss)

"""

def __init__(
self,
name,
Expand Down Expand Up @@ -194,9 +269,38 @@ def _load_from_state_dict(
)

def save_snapshot(self, snapshot_name):
"""save snapshot

Args:
snapshot_name (str): the snapshot_name, snapshot will be saved in the snapshots dir under your_configed_persistent_path

For example:

.. code-block:: python

>>> import oneflow as flow
>>> # use embedding create by flow.one_embedding.MultiTableEmbedding
>>> embedding.save_snapshot("my_snapshot1")
>>> # a snapshot named "my_snapshot1" have been saved in the "snapshots" dir under your_configed_persistent_path
>>> # which can be reload by flow.one_embedding.load_snapshot
"""
self.handler.SaveSnapshot(snapshot_name)

def load_snapshot(self, snapshot_name):
"""load snapshot

Args:
snapshot_name (str): the snapshot_name, snapshot will be load from your_configed_persistent_path

For example:

.. code-block:: python

>>> import oneflow as flow
>>> # use embedding create by flow.one_embedding.MultiTableEmbedding
>>> embedding.load_snapshot("my_snapshot1")
>>> # load a snapshot named "my_snapshot1" from your_configed_persistent_path
"""
self.handler.LoadSnapshot(snapshot_name)

def forward(self, ids, table_ids=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个方法之前没注意。我觉得有必要补以下 docstring

Expand All @@ -216,6 +320,20 @@ def forward(self, ids, table_ids=None):
def make_device_mem_store_options(
persistent_path, capacity, size_factor=1, physical_block_size=512
):
"""make GPU only store_options param of MultiTableEmbedding

Args:
persistent_path (str, list): persistent storage path of Embedding. If passed a str, current rank Embedding will be saved in path/rank_id-num_ranks path. If passed a list, the list length must equals num_ranks, each elem of list represent the path of rank_id Embedding.
capacity (int): total capacity of Embedding
size_factor (int, optional): store size factor of embedding_dim, if SGD update, and momentum = 0, should be 1, if momentum > 0, it should be 2. if Adam, should be 3. Defaults to 1.
physical_block_size (int, optional): physical_block_size should be sector size. Defaults to 512.

Returns:
dict: GPU only store_options param of MultiTableEmbedding

See also :func:`oneflow.one_embedding.make_cached_ssd_store_options`
"""

assert isinstance(persistent_path, (str, list, tuple))
assert capacity > 0
options = {
Expand Down Expand Up @@ -245,6 +363,29 @@ def make_cached_ssd_store_options(
size_factor=1,
physical_block_size=512,
):
"""make SSD use GPU as cache store_options param of MultiTableEmbedding

Args:
cache_budget_mb (int): the mb budget of per GPU as cache.
persistent_path (str, list): persistent storage path of Embedding, must use fast SSD because of frequently random disk access during training. If passed a str, current rank Embedding will be saved in path/rank_id-num_ranks path. If passed a list, the list length must equals num_ranks, each elem of list represent the path of rank_id Embedding.
capacity (int): total capacity of Embedding
size_factor (int, optional): store size factor of embedding_dim, if SGD update, and momentum = 0, should be 1, if momentum > 0, it should be 2. if Adam, should be 3. Defaults to 1.
physical_block_size (int, optional): physical_block_size should be sector size. Defaults to 512.

Returns:
dict: SSD use GPU as cache store_options param of MultiTableEmbedding

For example:

.. code-block:: python

>>> import oneflow as flow
>>> store_options = flow.one_embedding.make_cached_ssd_store_options(
>>> cache_budget_mb=8192, persistent_path="/your_path_to_ssd", capacity=vocab_size,
>>> )
>>> # pass the store_options to the "store_options" param of flow.one_embedding.MultiTableEmbedding
>>> # ...
"""
assert isinstance(persistent_path, (str, list, tuple))
assert cache_budget_mb > 0
if capacity is not None:
Expand Down Expand Up @@ -274,6 +415,20 @@ def make_cached_ssd_store_options(
def make_cached_host_mem_store_options(
cache_budget_mb, persistent_path, capacity, size_factor=1, physical_block_size=512,
):
"""make host use GPU as cache store_options param of MultiTableEmbedding

Args:
cache_budget_mb (int): the mb budget of per GPU as cache
persistent_path (str, list): persistent storage path of Embedding. If passed a str, current rank Embedding will be saved in path/rank_id-num_ranks path. If passed a list, the list length must equals num_ranks, each elem of list represent the path of rank_id Embedding.
capacity (int): total capacity of Embedding
size_factor (int, optional): store size factor of embedding_dim, if SGD update, and momentum = 0, should be 1, if momentum > 0, it should be 2. if Adam, should be 3. Defaults to 1.
physical_block_size (int, optional): physical_block_size should be sector size. Defaults to 512.

Returns:
dict: host use GPU as cache store_options param of MultiTableEmbedding

See also :func:`oneflow.one_embedding.make_cached_ssd_store_options`
"""
assert isinstance(persistent_path, (str, list, tuple))
assert cache_budget_mb > 0
assert capacity > 0
Expand Down Expand Up @@ -303,12 +458,69 @@ def make_cached_host_mem_store_options(


def make_uniform_initializer(low, high):
"""make uniform initializer param of make_table

Args:
low (float): A python scalar. Lower bound of the range of random values to generate.
high (float): A python scalar. Upper bound of the range of random values to generate.

Returns:
dict: initializer param of make_table

For example:

.. code-block:: python

>>> import oneflow as flow
>>> initializer = flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
>>> # pass the initializer to flow.one_embedding.make_table
doombeaker marked this conversation as resolved.
Show resolved Hide resolved
>>> # ...
"""
return {"type": "uniform", "low": low, "high": high}


def make_normal_initializer(mean, std):
"""make normal initializer param of make_table

Args:
mean (float): A python scalar. Mean of the random values to generate.
std (float): A python scalar. Standard deviation of the random values to generate.

Returns:
dict: initializer param of make_table

For example:

.. code-block:: python

>>> import oneflow as flow
>>> initializer = flow.one_embedding.make_normal_initializer(mean=0, std=0.01)
>>> # pass the initializer to flow.one_embedding.make_table
>>> # ...
"""
return {"type": "normal", "mean": mean, "std": std}


def make_table(initializer):
"""make table param of MultiTableEmbedding tables

Args:
initializer (dict): initializer param, make by make_uniform_initializer or make_normal_initializer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我觉得这些方法,也得有 example。 然后上文的那些初始化方法,即使没有必要每个都有 example,那起码也可以给个交叉引用,写 See xxx 比较方便。

不过这些都只是初步粗浅的看法。具体以 @Chenqll 体验后给出建议。


Returns:
dict: table param of MultiTableEmbedding tables

For example:

.. code-block:: python

>>> import oneflow as flow
>>> initializer = flow.one_embedding.make_uniform_initializer(low=-scale, high=scale)
>>> table1 = flow.one_embedding.make_table(initializer)
>>> table2 = flow.one_embedding.make_table(initializer)
>>> tables = [table1, table2]
>>> # pass the tables to the "tables" param of flow.one_embedding.MultiTableEmbedding
>>> # ...

"""
return {"initializer": initializer}