merge Master into zj/develop (#21) · ZJLabDubhe/oneflow-zj@26a143e

Commit

merge Master into zj/develop (#21)

* Multi Tensor apply Optimizer (#8373)

* Add optim_cast and modify sgd

* Remove

* try to add fuseUpdatecast pass logic

* use pass

* still have bug in inplace

* ban inplace and fix sgd update

* fix regst num

* add env var

* remove cuda graph wrong use

* add support for graph

* initialize

* add functional impl

* add simple job rewrite

* delete redundant sgd update kernel

* support half

* add kernel

* use single loop kernel

* refine

* when in eval mode, we turn off multi tensor update

* refine format

* use juncheng kernel

* Refine

* group multi tensor op by some attr

* add parallel conf to key

* refine

* Add unroll logic

* fix bug

* restruct

* use pointer list

* add adam kernel

* support multi tensor adam update

* Remove cpu

* support skip if and scale by tensor

* support sgd adam unittest

* add more check

* Remove config

* Restruct tensorparams

* support fused cast in multi tensor update

* support cast in multi tensor

* fix bug in model update cast pass

* fix multi tensor sgd update with cast Pass check logic

* refine

* support multi tensor adam update with cast

* refine format

* Remove redundant template args

* merge modify for fused cast

* only allow fused cast in train mode

* only support data parallel in multi tensor update

* rewrite fuse update cast pass logic

* remove redundant if

* fix format

* add new line

* rename

* Remove print

* rename and add LOG

* Add more type and test

* still have bug in multi tensor adam

* Fix multi tensor adam update bug

* add multi tensor adam update with cast test

* simplify code

* fix format

* Add model diff datatype in optimizer key

* remove random seed

* fix comment

* fix comment

* fix to use model copy

* use for loop

* Fix comment

* use hashcombine

* fix clang analysis error

* add with cuda macro

* fix env var in unittest

* remove redundant unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix doc and ops template auto gen (#8546)

* fix doc and add op calculator

* fix bug

* fix gen_ops

* fix diag 0size tensr shape infer bug (#8557)

* fix diag 0size tensr shape infer bug

* refine

* refine

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Format tensor on cpu (#8548)

* Format tensor on cpu

* use tensor.detach

* Remove useless WITH_CUDAs (#8562)

* unique identity (#8509)

* unique identity

* fix

* add identit name

* rm debug log

* mv identity form class to graph

* auto format by CI

* fix unique iden with having multiple stage

* auto format by CI

* Update block.py

Co-authored-by: cheng cheng <472491134@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add GenericStreamContext (#8560)

* Modify some file and add test (#8556)

* Modify some file and add test

* modify the content

* modify the format and test function name

* modify the format and aligned with pytorch

* delete print

* modity the function name

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Move some op into amp gray list (#8545)

enlarge gray list

Co-authored-by: cheng cheng <472491134@qq.com>

* Refine inplace expand runtime_error (#8561)

* Refine inplace expand runtime_error

* Opt

* Refine

* Add Note

* OneEmbedding use malloc async (#8543)

* in out ptrs

* ops and test

* test pass

* prefetch tmp buffer

* embedding shuffle tmp buffer

* gradient shuffle

* tmp buffer size

* mem pool

* cuda 11.2

* add id_shuffle to setNumunique in update tests

* default not use dynamic alloc

* fix of_tidy

* add fused op

* address review

* init tmp_buffer

* mv memset

* fix

* one_embedding fused_lookup_init_cast and fused_update_put (#8564)

* add fused op

* mv memset

* fix

* address review

* rm fullcache n_missing check

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cpu aligned_alloc size (#8569)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add flow norm (#8535)

* add flow norm

* rm import

* rm  doctest.testmod

* fix pad_packed_sequence method input requires_grad==True (#8574)

* fix pad_packed_sequence method input requires_grad==True

* fix append error when batch_first=True

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix embedding manager tmp buffer (#8585)

* fix embedding manager

* format

* fix reduce_ops 0size bug (#8551)

* fix reduce_ops 0size bug

* fix commnet

* auto format by CI

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Align Momentum Optimizer (#8549)

* fix moemntum update

* align momentum

* fix bug and finish eager unittest

* Support Graph optimizer

* fix momentum bug

* refine beta

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fill GetSbp bug and consistent test bug (#8576)

fix(FillOp): fill GetSbp bug and consistent test bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev Fully fused MLP Grad[OneEmbedding] (#8462)

* support fully fused mlp grad in eager

* support lazy backward

* fix output size

* add fallback to tmp_buf logic when ones buffer is not enough

* build sbp

* overlap allreduce

* fix overlap order

* fix format

* CUDA Graphs delayed capture

* Add ifcomm create for graph

* insert weight event roughly

* fix dbias allreduce error

* simplify code

* Add 11060 limit

* Remove print

* Rename

* fix fill bug and remove comm to cache

* Rename variable and add debug code for cache

* Use kernel state and fix bug

* remove print

* fix allreduce dbias bug

* fix header file

* fix comment

* remove redundant headerfile

* fix userops build error

* refine

* init nccl comm before execute kernel

* fix comment

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>

* rename mirrored to local (#8503)

* rename mirrored to local

* rename files

* rename files

* auto format by CI

* revert change of package_mirror.py

* rename LocalObject to Dependence

* rename fn LocalObject to Dependence

* merge master

* handle clang check

* fix

* refine

* rename local_object to dependence

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Implement BroadcastElementwiseUnary primitive (#8384)

* Add code skeleton for broadcast unary primitive

* first try

* finish impl

* finish impl

* format

* fix build error

* address review

* refine

* address review comments

* use broadcast unary primitive in fill_tensor_ kernel

* handle pack tail statically

* fix

* address review

* address review

* Fix SimplifyBroadcastDims

* fix

* revert fill_kernel

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

* skip cpu autotest for graph global (#8593)

* TODO

* skip cpu autotest for graph global

* Refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add function_library.h Exception (#8241)

* add RuntimeError for checking

* add RuntimeError to CHECK_EQ

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor shrink (#8573)

* caching allocator

* auto format by CI

* Update ep_device_context.h

* EpDeviceCtx with CachingAllocator

* rm RawAllocator typename

* auto format by CI

* specific allo in EpDeviceCtx

* auto format by CI

* rm outdated alloc

* simplify thread safe guard

* auto format by CI

* avoid return mutex

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Speed up SliceKernel (#8589)

* perf(SliceKernel): descrease number of cuda kernel and speed up

* perf(SliceKernel): use old kernel when small tensor is all fullslice

* use std::copy to copy contiguous memory

* fix cpu kernel bug

* Update readme and vsn for 0.8.0 (#8600)

* update version

* remove py3.6

* modify some file and improve error message (#8592)

* modify some file and improve error message

* modify scalar_by_tensor_op.cpp

* Update scalar_by_tensor_op.cpp

* Update slice_op.cpp

* Update test_slice_op.py

* Update test_slice_op.py

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* rename consistent to global (#8505)

* rename consistent to global

* rename consistent to global

* rename files

* rename files

* refine

* auto format by CI

* refine

* fix clang check

* fix

* fix

* fix

* rm to_consistent docs

* auto format by CI

* refine

* fix

* fix

* revert changes

* auto format by CI

* revert changes

* revert changes

* rename

* rename

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* add module releated container docs (#8580)

* add module releated container docs

* auto format by CI

* fix comment

* refine

* refine

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix rnn util extra memory usage when requires_grad=False (#8603)

* fix rnn util extra memory usage when requires_grad=False

* add comments

* refine comments

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* use bracket format slice in tensor str (#8489)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Perf TensorInfo constructor (#8606)

* perf(Autograd): perf TensorInfo constructor

* rename consistent to global

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* print operators' python location when print nn_graph (#8558)

1. add a flag in nn.Graph.debug() named print_op_loc for printing operator location.
2. add a flag in nn.Graph.debug() named only_print_user_code_loc for only print users' code location

* Add randint like (#8598)

* add randnint_like op

* add docs for random

* refine

* auto format by CI

* add randint_like global test

* refine doc

* refine randint_like docs

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add full_like api (#8595)

* add full_like_op api

* refine

* add test

* refine

* refine docs

* refine

* add consistent_full test

* add full_like op

* fix docs commnet

* change scalar sbp return value from list to tuple

* auto format by CI

* merge conflict

* revert

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix cumsum GenBackwardOpConfFn (#8604)

* fix cumsum GenBackwardOpConfFn

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* revert change (#8613)

* fix test graph optimization conf CI bug (#8617)

* restore resource config after random tests

* refine

* refine

* Release pod tensor (#8552)

* ThreadLocalGuard

* split ReleaseTensor into ReleasePodTensor and ReleaseNonPodTensor.

* rename

Co-authored-by: luyang <flowingsun007@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add param group for optimizer (#8611)

* add add_param_group interface for Optimize

* add test for add_param_group

* revert

* fix comment

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix broadcast_elementwise_binary cpu (#8625)

fix broadcast_elementwise_binary_cpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* align exception msg to torch (#8627)

* align exception msg to torch

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* skip unstable global test in ci, reduce failture rate (#8635)

* fuse embedding interaction (#8586)

* fuse embedding interaction

* fix of_tidy

* refine

* fix

* address review

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix flip gen backward opconf (#8605)

* fix flip gen backward opconf

* use new opconf api

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED (#8597)

* Add ONEFLOW_ONE_EMBEDDING_PERSISTENT_TABLE_SNAPSHOT_LOAD_MMAP_LOCKED

* refine

* use MAP_POPULATE

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Profiling main thread (#8601)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fully Memory Log V2 with more details (#8565)

* Fully Memory Log V2 with more details

* refine log and long op name

* fix clang tidy

* fix test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Stream policy (#8590)

* ThreadLocalGuard

* refactor signature of StreamType::InitDeviceCtx

* refactor hint

* add StreamPolicy

* remove DeviceCtx args

* refine OpCallInstructionUtil::Prepare & Compute

* merge EpDeviceCtx and LazyJobDeviceCtx into StreamPolicy

* minor fix

* minor fix

* del useless code

* fix error

* fix merge error

* fix segment fault bug

* fix complie error

* del methods belong to Subclass

* reslove comment

Co-authored-by: binbinHan <han_binbin@163.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add fully support for broadcast matmul (#6937)

* fix arange bug

* fully support broadcast matmul

* add more check

* remove check

* add fully sbp

* fix full sbp

* Fix broadcast matmul grad

* remove old broadcast matmul grad

* add broadcast grad back and when B numaxes is 2, we use broadcast_gradB instead of matmul+reduce

* add lazy backward

* Add restrict when transpose_a is false we can use bmatmul_grad_b

* revert

* fix broadcast matmul backward

* fix single client dispatch matmul logic

* revert old bcast matmul grad b kernel

* fix eager functional matmul backward

* add more test case

* remove redundant code

* add more special case

* when b num axes is 2, we only save tensor a

* fix annotation

* fix conflict and format

* remove single client matmul code

* Fix eval error

* fix conflict

* fix unittest

* Add init value

* support matrix vector matmul

* add vector matrix product

* Use matmul primitive to rewrite matrix vector product forward and backward

* Add fullllllllly support for vector matrix product

* Fix sbp

* fix bug

* add unittest

* Add consistent test for broadcast matmul

* Remove redundant code

* fix userops annotation

* fix

* refine

* Fix clang static analysis

* fix clang analysis

* set check graph as false

* fix

* fix for unittest

* fix broadcast sbp bug

* try to fix unittest

* Fix consistent test

* fix multiplier to 4 for unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert "skip cpu autotest for graph global" (#8608)

* Revert "skip cpu autotest for graph global (#8593)"

This reverts commit b076be782fd8f21e50ee4915f2d1562f3a9ab4c0.

* cherry pick from master

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* OneEmbedding add tmp_buffer allocator (#8588)

* fix embedding manager

* format

* refine embedding_manager tmp_buffer allocator

* fix

* format

* refine

* refine

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* refine error msg for some user ops (#8579)

* refine error msg for some user ops

* refine error msg for some user ops

* optimize

* optimize the writing

* optimize the writing

* optimize the writing

* auto format by CI

* optimize writing

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add tril fill value (#8655)

add tril fill value

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix_non_pod_data_allocate_bug (#8657)

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix norm (#8629)

* fix norm

* add doc

* add bool &

* update math_functor.cpp

* add note

* fix_decorate_mem_leak_bug_in_eager_boxing (#8661)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add higher order derivative for leaky_relu and negative op (#8643)

* add higher derivative for leakyrelu and negative

* fix a typo

* remove functor

* add initialize alpha

* fix incorrect dim size in global test

* fix incorrect dim size in global test

* optimize testcase

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* update oneflow intro to show the difference (#8669)

* update oneflow intro

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine

* refine oneflow intro

* Stacked error (#8671)

* ThreadLocalGuard

* StackedError

* StackedError

Co-authored-by: Shenghang Tsai <jackalcooper@gmail.com>

* Refactor tensor initializer (#8626)

* fix(*): fix xavier_initializer

* refactor(Initializer): refactor initializer

* fix function name

* auto format by CI

* refine

* fix interface in tensor.py

* fix(trunc_normal_): fix init bug and add test

* auto format by CI

* fix bug

* add oneflow.nn.init.normal_ test

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix nn doc (#8650)

* fix hsplit doc

* add doc for module

* fix dtype

* fix formula

* add ref

* fix row length

* Fix reduce max min bool dtype bug (#8651)

* fix reduce_max_min_bool_dtype

* fix bug

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Remove redundant exception wrapper (#8631)

* remove redundant ExceptionWrapper

* refine KeyErrorMessage

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Refactor MemoryCase to eliminate determine statements of device_type (#7727)

* ref memory_case_util

* ref BlobObject::CheckMemCase

* ref mem_case using

* address review

* address review

* namespace memcase -> memory

* fix conflict

* address review

* address static analysis

* rm check

* cpu device_id is always 0

* fix conflict

* timeout-minutes: 50

* revert change

* increase thrd limit in container

* skip 2x2 TestEinsumConsistent

* skip failed case of distributed test

* auto format by CI

* fix_non_pod_data_allocate_bug

Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: tsai <jackalcooper@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: clackhan <han_binbin@163.com>

* fix some data races in c++ api and SteadyVector (#8654)

* fix some data races in c++ api and SteadyVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip self copy in MutShapeView::ToShape

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Fix sin/cos higher order derivative (#8648)

* fix(GradGrad): fix sin/cos higher order derivative

* fix(GradGrad): fix calculate error

* refine autograd global test

* auto format by CI

* refine sin/cos grad_grad calculate

* fix static analysis

* merge conflict

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Ping Zhu <58718936+REYGU@users.noreply.github.com>
Co-authored-by: Zhu, Ping <pingzhuu@outlook.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* refine_eager_boxing_to_adapt_ep (#8568)

* refine_eager_boxing_to_adapt_ep

* fix typo

* refine

* refine symmetric-acyclic-nd-sbp-to-nd-sbp

* refine

* fix error

* fix static check

* add NOLINT

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix repeat bug (#8645)

* make result contiguous

* add test case

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Instruction policy (#8583)

* ThreadLocalGuard

* vm::InstructionPolicy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* rm include stream type

* fix stream type

* auto format by CI

Co-authored-by: Yu OuYang <xuanjiuye@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* handle non-contiguous input (#8665)

* handle non-contiguous input

* refine

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* rename define CONSISTENT to GLOBAL (#8652)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Refine naive interpret (#8672)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* explicit scalar initialization

Co-authored-by: clackhan <han_binbin@163.com>

* Rebuild Docs V0.8.0 (#8392)

* rebuild for 5 module

* fix bug

* fix for doctree and content  in nn and

* fix

* fix

* fix

* add some

* fix for oneflow.rst

* update oneflow oneflow.nn

* update tensor

* update tensor module

* update

* test

* update

* update

* fix for undone desc

* docs: oneflow.utils.data (#8485)

* feat(utils.data): add oneflow.utils.data

* docs(dataloader): change the docstring of DataLoader

* docs(tensor): add methods to oneflow.Tensor document

* docs(optim): change docstring of optimizer and add a note to the doucument

* nn.graph

* fix for graph

* fix bug

* review nn and linalg document (#8515)

* docs(nn): add contents to oneflow.nn document

* docs(linalg): refactor oneflow.linalg document

* change attributes.rst and review nn.functional.rst (#8514)

* change attributes.rst and review nn.functional.rst

* reconstruction oneflow.cuda

* fix cuda and rebuild comm demo (#8582)

* update image

* add distributed

* oneembedding & refine graph

* update for sdisributed one_embedding

* fix rnn.py (#8616)

* 重构 oneflow.nn.init 文档 (#8622)

docs(nn.init): refactore nn.init document

* docs(nn.init): remove the comments

* docs(utils.data): remove the comments

* update and fix bug

* docs(review): refine the documents (#8646)

* docs(review): refine oneflow, nn, Tensor, nn.init, linalg, utils.data, optim modules

* docs(optim): modify the code examples

* docs(tensor): edit note

* 重构 oneflow.autograd 文档 (#8594)

* docs(autograd): refactor oneflow.autograd

* docs(autograd): edit "Default gradient layouts".

* docs(autograd): reedit "Default gradient layouts"

* docs(autograd): add comment

* docs(autograd): add reference

* update

* docs(tensor): change autoclass to autosummary

* update

* update

* add oneflow.linalg.diagonal (#8653)

* docs(linalg): add oneflow.linalg.diagonal

* update enviorment variable

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update enviorment variable

* update for ev & distributed

* update distribued

* update ev

* update distribute desc

* Update docs/source/distributed.rst

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* update

* 修改 docstring 描述 (#8656)

* docs: move pytorch refernce to end

* docs: add some docstring

* docs(refs): add refs

* Update docs/source/distributed.rst

* updte for distributed details and environment_variable

* docs(docstring): Modify all reference links to version 1.10 (#8663)

* fix bug

* fix bug

* fix all warning

Co-authored-by: Guoliang Cheng <1876953310@qq.com>
Co-authored-by: liu xuan <85344642+laoliu97@users.noreply.github.com>
Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: laoliu97 <841637247@qq.com>
Co-authored-by: Yao Chi <later@usopp.net>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Fix zeros like and ones_like api (#8632)

* fix zeros_like and ones_like bug

* refine

* revert

* refine

* fix tensor_slice_view infer physic_shape bug

* add test

* refine

* auto format by CI

* fix bug

* refine

* auto format by CI

* fix import error

* fix bug

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix sbp print bug (#8689)

* Add a normal priority with no transfer but different sbp

* Fix the bug for printing no boxing edge

* Do not use P for weights

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* eager_local_interpreter_with_infer_cache (#8619)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gelu nn.Module bug and support tanh mode. (#8693)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu (#8696)

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug in CrossFeatureInteraction LazyBackward (#8677)

fix bug

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix floating-point scalar tensor in arange (#8673)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn functional fold (#8667)

* add fold

* update fold.py

* add test

* fix doc

* fix comment

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify some file and improve the error message (#8566)

* modify some file and improve the error message

* modify the content

* modify the content

* auto format by CI

* Update roi_align_op.cpp

* Update roi_align_op.cpp

* Update reshape_user_op_util.cpp

* auto format by CI

* Update roi_align_op.cpp

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* [OneEmbedding] add id_shuffle_copy_out (#8683)

add id_shuffle_copy_out

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix add_param_group step key not match error (#8698)

* fix add_param_group step key not match error

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add env ONEFLOW_EP_CUDA_DEVICE_FLAGS and ONEFLOW_EP_CUDA_STREAM_FLAGS (#8703)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix for docsv0.8 (#8710)

* fix repeat op 0-size releated bug (both in FW and AD) (#8707)

* fix repeat op 0-size releated bug (both in FW and AD)

* refine

* refine static check

* refine

* fix commnet

* fix comment

* refine

* fix test

* auto format by CI

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support Dropout Scale in FusedMLPGrad[OneEmbedding] (#8633)

* support alpha list

* Remove redundant modify

* remove redundant alpha set

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix bug of Tensor.type (#8697)

* fix bug of tensor.type(flow.Tensor)

* fix bug of tensor.type(flow.Tensor) about device

* Fix tensor type doc (#8699)

fix doc of tensor.type

* add test for tensor.type(flow.Tensor)

* move PyTensorMetaCls_CheckExact to header file

Co-authored-by: Shanshan Zhong <62104945+zhongshsh@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS (#8706)

* ONEFLOW_GRAPH_PLACE_TRAINING_STATE_ON_ALL_RANKS

* auto format by CI

Co-authored-by: liujuncheng <liujuncheng1022@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx (#8709)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add qat conv modules (#8368)

* add qat conv modules

* add quantization related modules to doc

* refine qatconv modules doc

* add qat conv module tests

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add unsqueeze_multiple_op (#8714)

* add unsqueeze_multiple_op

* modify the format

* Update functional_api.yaml

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* modify broadcast_like_op.cpp and add test (#8720)

* modify broadcast_like_op.cpp and add test

* modify broadcast_like_op.cpp

* Update broadcast_like_op.cpp

Co-authored-by: Yinggang Wang <wyg19970408@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* JIT LR (#8500)

* add example code

* Update cosine_annealing_lr.py

* enable self params transformer

* enable pass ast to c++ api

* enable jit backend for lr

* enable jit global register and invoke

* convert Global to Singleton for new merge

* enable pybind11 walk on python ast

* enable test all existent get_lr of oneflow in python

* enable py_ast_wrapper pass ast from python to mlir

* switch all ast to ast-wrapper in mlir scope

* define python ast partially

* partial python ast definition

* trim asdl of python ast

* mlir gen

* add symbol table

* from ast to jit done

* switch llvm::errs() to mlir::emitError and convert switch to typeSwitch

* trim duplicate namespace use

* fix LIT header

* add some docs

* enable compare with or_else, if with return seamless in branch and mutable variable

* trim code and refine struct

* register pybind11 ast node for shared_ptr

* enable cpp class in python

* go through python to mlir to llvm to jit to run

* add addf subf op

* work well on stepLR linearLR exponentialLR coseineDecayLR cosineAnnealingLR constantLR

* enable maxf minf conversion to llvm ir

* rename LR_JIT to LRJITRegister

* remove LR_JIT_Engine and swith Invoke to std::function ret by  lookup

* refine struct

* enable bisect_right and python resigter api have dump option arg

* add bisect_left and bisect_transformer specially, delete former test python script

* remove c++17 standard

* restore double hash to iterator

* publish

* publish

* publish

* use llvm classof and typeswitch rightly

* trim

* commit

* commit

* commit

* commit

* commit

* commit

* auto format by CI

* Update ir.cpp

* Update OneFlowLRJITRegistry.h

* auto format by CI

* Update AstMlirGen.h

* Update lr_jit.cpp

* auto format by CI

* Naming conventions

* auto format by CI

* auto format by CI

* deploy _ behind

Co-authored-by: leaves-zwx <kunta0932@gmail.com>
Co-authored-by: yuhao <1171760467@qq.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: yuhao <72971170+howin98@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add logspace (#8599)

* add logspace

* add global test

* restore rand

* fix doc

* rename consistent to global

* adjust import order

* add todo

* Add hann_window (#8615)

* add hann_window

* rm useless include

* add check

* adjust import order

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE (#8730)

* add ONEFLOW_VM_PENDING_HANDLE_WINDOW_SIZE

* add environment to vm.h

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix as strided bool type and view bug (#8713)

* fix as_stride bug

* refine

* refine

* refine

* delete useless head file

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add functional binary cross entropy (#8708)

* add gelu2 api

* refine test

* refine docs

* refine

* restuct

* delete useless headfile

* format

* rm doc of tensor.gelu

* add functional binary cross entropy

Co-authored-by: BBuf <1182563586@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support map_location in flow.load (#8666)

* support map_location in flow.load

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* fix tests

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix bug when map_location is None

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Add addcdiv (#8581)

* add addcdiv

* fix tensor_functions

* fix inplace

* add test number

* rename consistent to global

* Inner most dim case for cumsum cumprod op (#8403)

* cumsum use cub scansum in some case

* prod use cub scan

* refine name

* refine

* optimize cum op

* format

* fix

* get device properties by cuda stream class

* revert useless code

* refine

* outer dim use parallel sweep algo

* refine

* fix a fraction of threads hit __syncthreads

* revert

* refine kernel define

* refine

* refine

* refine

* refine

* move comment

* fix

* fix

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output dtype and mut output is dynamic in infer ctx (#8716)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Dev refactor fuse instruction policy (#8624)

* ThreadLocalGuard

* vm::InstructionPolicy

* refactor fuse instruction policy

* fix compile error (#8623)

* fix compile error

* change MirroredObject to Dependence

* Modify DependenceVector

* add instruction policy util

* add instruction policy util

* remove include

* add include

* rm fuse instruction type

* Modifying variable properties

* add stream_sequential_dependence_ to instruction_policy

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of batchnorm num_batches_tracked global error when loading state_dict (#8723)

add condition for assign num_batches_tracked

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add launch master port limit (#8563)

* add launch master port limit

* Update python/oneflow/distributed/launch.py

Co-authored-by: daquexian <daquexian566@gmail.com>

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Fix docs import distance (#8691)

* fix import distance

* add functional apis

* add smooth_l1_loss docs

* refine activation.py

* add deleted api

* review

* 添加oneflow, nn 等模块文档中遗漏的接口 (#8704)

* docs: add api

* docs(nn): refactor nn

* review

Co-authored-by: Guoliang Cheng <lmyybh_lazy@163.com>
Co-authored-by: ChenQiaoling <48576019+Chenqll@users.noreply.github.com>

* refactor control stream type (#8647)

* refactor control stream type

* auto format by CI

* Add method implementation

* refine

* refien

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Define mut output tensor desc (#8717)

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Symbolic local tensor meta (#8662)

* ThreadLocalGuard

* refactor EagerBlobObjectList

* op_args_reserved_size

* remove useless comments

* rename one::EagerBlobObjectList* to vm::EagerBlobObject*

* refactor signature of InstructionsBuiler::Call

* PhysicalRun

* refactor InstructionsBuilder::Call

* remove unused StatefulOpKernel::need_check_mem_case

* remove EagerLocalTensorImpl::is_shape_synced_

* eager_local_interpreter_with_infer_cache

* remove useless code

* reslove comments

* refactor TensorMeta::TensorMeta(const TensorMeta)

* use small vector

* Symbolic LocalTensorMeta

* check shape in critical_sectio

* add kMaxNumDims

* fix error include

* fix split Symbol LocalTensorMeta error

* fix split cache and symbolic local tensor meta error

* refactor SoftSync

* move SmallVector from common/container_util.h to framework/instructions_builder.cpp

* mone ONEFLOW_EAGER_ENABLE_LOCAL_INFER_CACHE to eager.h

* add blank line

* reslove comments

* minor fix

* refine

* explicit scalar initialization

* fix static check error

* auto format by CI

* of_format

* reslove comment

* refine

* refine

* refine

* fix error

* define MutOutputShape and MutOutputStride in InferContext

* define_mut_output_shape_and_mut_output_stride_in_infer_ctx

* fix merge master error

* fix typo

* fix static check error

* define_mut_output_dtype_and_mut_output_is_dynamic_in_infer_ctx

* define_mut_output_dtype_and_mut_output_tensor_desc

* replce const DataType& with DataType

* split const and mut func in LocalTensorMeta

* replace const DataType& with DataType ret

* split TensorDesc4ArgNameAndIndex and MutTensorDesc4ArgNameAndIndex

* refine

* minor fix

* fix merge error

* fix warning error

* refine

* fix static check error

* Update op_expr.cpp

* Update op_expr.cpp

* split MutTensorMeta and MutLocalTensorMeta

* Update stateful_opkernel.cpp

* refine

* fix static check error

* refine

* refine

* reslove comment

* refine

* fix typo

Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* fxi typo

* use OpArgsVector

Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>

* Feat general basic communication (#8437)

* Add a slight cost for B->S and B->P in 2d sbp

* Add penalty for P in consumer

* Fix a slight bug

* Add at most 1 middle node for general basic communication

* Add the cost for general basic communication

* Add the slight penalty for eager

* Skip initialization of boxing collector if not needed

* Fix a bug

* Dev nd nccl send recv boxing (#8467)

* nd nccl_send_recv_boxing

* rm print

* support num_axes > 2

* Add distributed optional run (#8372)

* Add

* change deps

* add install

* add skip

* autoprof supports bandwidth (#8367)

* autoprof supports bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* print bandwidth

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* remove tmp buffer of cumprod cpu backward kernel (#8369)

* remove tmp buffer of cumprod cpu backward kernel

* refine

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Move tensor api to cpython part3 (#8342)

* add tensor_functions

* concat py methods

* add hash, restore tensor.py

* check replacement

* refine code, remove commented tensor.py

* refine code

* move some api

* add cpu and cuda api

* add triu tril norm and etc.

* remove tensor_functions.h

* move more api

* move more api, refine size

* fix typo

* format code, remove useless include

* refine code

* refine code, fix typo

* align .cuda to python

* refine code

* split some api to part3 for review

* remove positional only arguments of argmax and argmin

* remove arguments parse

* modify arguments name in matmul and floor_divide

* rename BINARY_FUNC to DIRECT_PASS_FUNC, modify some functions

* refine code, format code

* add inplace /=, add comments

* remove name in macros

* remove python api

* remove redundant include

* remove cout

* format code

* refactor tensor.size by directly call shape.at, refactor tensor.sub_ by calling nb_sub_

* remove redundant code

* auto format by CI

* fix typo, fix wrong call

* modify idx datatype from int32 to int64 in tensor.size

* add some DIRECT_PASS_FUNC

* add cpu cuda var pow and etc.

* add masked_fill any all

* make REDUCE_FUNC macro, add reduce_* functions

* add 0dim check in ReduceSumWhole, refine yaml

* fix bug

* restore add add_ sub sub_

* add unittest for tensor.half tensor.add tensor.add_

* refine code

* refine code

* fix typo

* fix bug of tensor.std()

* refactor var std and cuda, using c++ functional api

* add beta and threshold in softplus

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add nn_functor Check (#7910)

* add bias_add_check

* add bias_add error test

* fix conv2d nhwc bias_add error

* add nhwc conv test

* add bias_add_error test

* Add bias add error check

* Rename

* add batch matmul error check

* add matmul check error msg

* remove annotation

* add fused mlp error msg check

* Add pixel shuffle check test

* add more test until normalization add relu functor

* refine error message

* finish all nnfunctor check msg

* handle type error

* remove useless symbol

* modify back to TypeError

* fix all comment

* Remove redundant code

* Remove pad ndim check

* fix bias add space

* fix check logic cause ci gpu not always gpu:0

Co-authored-by: hjchen2 <chenhoujiangcug@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add FusedMatmulBiasAddReluDropout [OneEmbedding] (#8222)

* previous version for fused_matmul_bias_add_relu_dropout

* add op infer

* fix detail

* finish forward

* support dropout rate list

* add forward test

* fix bug for output buffer

* Configurable alpha params

* try to add bit mask logic

* Add bitmask first version!

* Add row col bitmask logic

* support not align4 reludropout

* simplify relu dropout ld logic

* Add naive relu dropout grad kernel

* add simple relu dropout grad kernel

* Rename

* support relu_dropout bitmask backward

* add vectorized optimization

* fix tmp buffer

* add to amp list

* add lazy backward logic

* Refine kernel

* add indextype dispatch

* simplify functor logic

* fix cublas fused mlp aux_ld shape bug

* Add more relu dropout kernel

* add full unittest

* fix bug in skip final activation

* refine

* Remove dump func

* fix format

* Remove cmake

* remove redundant divide

* add padded version

* fix dropout

* oneflow curand

* refine

* remove redundant kernel

* add unroll logic

* add unroll and ballot sync

* refine format

* Remove fast curand

* Refine python interface

* Add if branch for memset

* fix python logic

* just for debug

* not use matmul bias add grad

* add launch 1 block limit

* fix unittest

* Refine

* fix graph backward bug

* limit to 11060

* change to use int32_t dtype for cublas aux

* Fix jc comment

* fix comment

* fix convert

* fix static_analysis

* fix at

* fix userops td

* fix userops td

* fix const ref

* fix compile error for bfloat16

* limit to 11060

* fix bug

Co-authored-by: Juncheng <liujuncheng1022@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix gather 0-dim tensor bug (#8376)

* fix 0-dim tensor bug

* refine

* support input 0-dim tensor for gather

* refine

* refine

* refine dim_scatter_kernel check

* refine

* refine check

* fix clang_tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add api to apply external job pass (#8370)

* Add condition to find-test-cache-distributed (#8387)

* add condition to find-test-cache-distributed

* fix

* warp dim util (#8382)

* warp dim util

* format

* use more maybe_wrap_dim

* refine array functor

* add more

* refine math_functor

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like (#8379)

* fix_bug_in_broadcast_min_max_grad_and_broadcast_like

* refine

* fix static check error

* fix bug about index (#8388)

* fix bug about index

* add test case

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* LogicalSliceAssign support full slice sbp (#8344)

* feat(SliceOp): slice ops support 2d sbp

* fix(SliceOp): fix [B, P] 2d sbp bug

* refine error message

* fix bug in parallel_num == 1

* add comment

* add warning and format

* add NOLINT for boxing check

* feat(LogicalSliceOps): support all nd_sbp

* feat(LogicalSlice): support nd_sbp

* add error message

* fix(AutoTest): fix auto_test bug in module.parameter pass

* auto format by CI

* fix(LogicalSliceAssign): skip test when 1n1d

* fix SliceParams memset error

* remove memset

* add CHECK_JUST

* fix(*): make sure split_axis >= 0 or equal to SPLIT_AXIS_FOR_NON_SPLIT

* remove memset

* fix spilit_info.axis bug

* feat(LogicalSliceOps): support grad

* add logical_slice gradient_funcs

* feat(LogicalSliceAssign): LogicalSliceAssign support full slice sbp

* auto format by CI

* test(LogicalSlice): fix logical_slice dims

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Houjiang Chen <chenhoujiangcug@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix_tensor_from_numpy_mem_leak_bug (#8391)

* fix_tensor_from_numpy_mem_leak_bug

* add note

* refine note

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Make of_pyext_obj static only to make sure only a python ext so has python symbols (#8393)

* make of_pyext_obj static only

* refine note

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Adjust tolerance setting in embedding_renorm unit test (#8394)

* support front end compile for job to iree (#8249)

* support frontend dev version

* polish name

* add tosa-to-elf.mlir

* tosa to elf by llvm

* conv2d partial

* an enhanced frontend runner

* support numpy as input

* enable multiple using nn graph with different input(jobname make it  it cd /home/yuhao/frontend/oneflow ; /usr/bin/env /usr/bin/python3 /home/yuhao/.vscode-server/extensions/ms-python.python-2022.6.2/pythonFiles/lib/python/debugpy/launcher 40873 -- /home/yuhao/frontend/oneflow/oneflow/ir/test/Frontend/runner.py )

* enable multiple input

* enable cpu and cuda

* change full_name to _full_name

* support exchange cuda with cpu seamlessly

* remove pip

* lit config

* polish

* trim

* auto format by CI

* modify

* auto format by CI

* last line polish

* use unittest

* auto format by CI

* use allclose

* auto format by CI

* pulish

* optimize convert oneflow to tosa

* conv2d

* conv2d enhanced && conv2d examples add

* add road map

* add add_n2Op and boardcast_addOp conversion

* add matmulOp conversion

* support converting normailzation op to tosa(partically)

* update roadmap

* support i64 tensor to dense elem attr

* support 100% resnet op conversion

* add test mlir

* add test iree resnet python script

* auto format by CI

* done

* enhance iree resnet test script

* auto format by CI

* rebuild code

* auto format by CI

* rebuild test script

* update

* auto format by CI

* pub

* trim test scripts

* move

* move

* input and output add block arg judgement

* emit error in variable conversion

* error handle for ci

* modify err info

* auto format by CI

* merge

* auto format by CI

* output not block

* flow ones

* rm const

* trim maybe

* trim maybe with header file

* const auto

* solve clangd error

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/zero mix with mp (#8036)

* add zero limit

* add debug

* add mix zero test

* refactor zero api

* zero test with mp

* add 2d test

* add zero nd

* add nd zero

* add sbp cast

* test passed soft limit consumer

* refine size api

* zero use stage 2

* add limit consumer api

* add new api

* refine zero s select

* fix index out of range

* rm zero limit on device type

* zero test with activation checkpointing

* add indentity when dp sequence len is 1

* move to base with master

* fix

* fix

* fix

* add test

* debug bad case

* refine test for eager and graph boxing

* test case ready

* simplify

* refine test

* fix buff size

* fix conflict

* refine zero nd

* refine

* add full test

* revert change

* refine split check

* fix typo

* rm log

* spit long func

* restore test

* Update optimizer_placement_optimization_pass.cpp

* auto format by CI

* auto format by CI

* fix static check

* add tips for zero api change

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Revert embedding normal path and fix amp list (#8374)

* revert embedding normal path, fix amp list

* fix amp

* fix memset bug in gather cpu kernel

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* replace fixed_vector with small_vector and make Shape inherit from it (#8365)

* Replace fixed_vector with llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* Shape inherited from llvm::SmallVector

Signed-off-by: daquexian <daquexian566@gmail.com>

* refine cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename fixed_vector to small_vector

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix reviews

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update Shape constructor

Signed-off-by: daquexian <daquexian566@gmail.com>

* add 'PUBLIC' keyword to all target_link_libraries

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* update cmake

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* set is_initialized_ default to true

Signed-off-by: daquexian <daquexian566@gmail.com>

* override some methods to set is_initialized_

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* Light plan for debug (#8396)

* Light plan for debug

* fix note

* disable terminfo to fix missing terminfo symbols (#8400)

* disable terminfo to fix missing terminfo symbols

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix bug of ZeRO MP in complex case (#8404)

* Remove redundant output_lbns in ir (#8409)

* mv case

* remove redundant info

* Dev FusedCrossInteraction[OneEmbedding] (#8335)

* add simple fused cross interaction forward

* add packed fused

* Add cross interaction grad

* simplify code

* fix bug

* support crossnet v2

* support cross interaction v2

* add lazy backward

* Rename and add test

* fix jc comment

* fix comment

* fix bug

* fix userops td elem_cnt for FUSED Group

* fix header file

* fix clang static analysis

* fix unittest

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add exe graph physical shape check msg (#8002)

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug information for the python stack trace

1. add a parameter 'max_stack_depth' to specify the max depth for the stack trace
2. refactor other debug related classes.

* remove parens

* update

* resolve PR comments

* update

* update graph debug test file.

* restore self._debug in class Graph and class ModuleBlock

* Do not shorten the stack frame string if it is in debug mode

* delete TODOs

* disable conv3d test (#7969)

Signed-off-by: daquexian <daquexian566@gmail.com>

* skip layernorm random_data_warp test (#7941)

* skip layernorm random_data_warp test

* warp/block/uncached case only test gpu

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Lock click version (#7967)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add global avgpool unittest (#7585)

* fix (#7978)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support negative dim in scatter op (#7934)

* support negative dim in scatter op

* refine scatter test

* refine scatter test again

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand (#7702)

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* the Env is never destroyed.

* export Env into python

* more unittests

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* reshape_only_one_dim_infered

* address pr comments

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rollback flow.env.all_device_placement

* no distributed running test_shutting_down.py

* auto format by CI

* expand lifetime of module oneflow in test_shutting_down.py

* refine del depend on of

* capture oneflow._oneflow_internal.eager when calling sync in __del__

* add try in flaky test

Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: chengtbf <472491134@qq.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: Xiaoyu Xu <xiaoyulink@gmail.com>

* Fix one hot scalar tensor bug (#7975)

* fix reduce_sum scalar check bug

* fix one_hot scalar tensor bug

* fix clang tidy error

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* support ctor np array from of tensor (#7970)

* support ctor np array from of tensor

* add test case constructing np array from tensor

* refine

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* add_manual_seed_all_api (#7957)

* add_manual_seed_all_api

* Update conf.py

* refine

* add test case

* auto format by CI

* Update random_generator.cpp

* auto format by CI

Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* one_embedding add doc string (#7902)

* add doc string

* add example

* add

* fix doc

* refine

* address review

* mb to MB

* add make_table_option

* option to options

* refine

* add forward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Support numpy scalar parameters (#7935)

* feat(functional): support numpy scalar parameters

* rename inferface

* feat(*): TensorIndex support numpy scalar

* feat(TensorIndex): support advance indexing

* add unittest and int32 support for branch feat-param_support_np_scalar (#7939)

* add unittest

* refactor unittest

* add todo for int16 advanced indexing

* add int32 supporting for advance indexing

* auto format by CI

Co-authored-by: Wang Yi <53533850+marigoold@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>

* fix tensor_scatter_nd_update (#7953)

* fix tensor_scatter_nd_update

* auto backward

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix one_embedding adam (#7974)

* fix one_embedding adam

* fix tidy

* fix normal

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* speed test with score (#7990)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Feat/graph del by ref (#7857)

* remove IsMultiClient() and single client logic

Signed-off-by: daquexian <daquexian566@gmail.com>

* rename eager.multi_client to eager

Signed-off-by: daquexian <daquexian566@gmail.com>

* auto format by CI

* add py ref

* refine new session

* clean code

* make scope api inner use

* use session with ref cnt

* run barrier callback in BarrierPhyInstrOperand::~BarrierPhyInstrOperand

* test pass

* lock gil in vm Callback thread

* more comments for VirtualMachineEngine::Callback()

* merge

* merge rm single client

* rm initenv

* merge and fix master

* refactor env c api

* add debug code

* fix and serving test pass

* test passed

* rm useless

* rm useless code

* format

* rm useless include

* rm sync in py

* the Env is never destroyed.

* export Env into python

* more unittests

* fix and pass tests

* revert virtual_machine.cpp

* revert core/vm

* remove outdated python class oneflow.unittest.TestCase

* graph test passed

* wait shared_ptr.use_count() == 0

* export unittest.TestCase in framework/unittest.py

* SwitchToShuttingDownPhase

* optional is_normal_exit

* VirtualMachine::CloseVMThreads

* Delete env_api.h

env_api.h is deleted by master

* address pr comments

* rm is env init

* Clear empty thread when graph destroy (#7633)

* Revert "Clear empty thread when graph destroy (#7633)" (#7860)

This reverts commit 3e8585e5fa20b97229d6b0be46a7ff814dc8cd83.

* fix a ref-cnt bug in TryRunBarrierInstruction.

* rm env_api

* fix clang-tidy error

* fix clang-tidy in env_imp

* refine env api

* format

* refine graph del and sync at shuttingdown

* fix typo

* add comment

* rm useless

* rm useless

Co-authored-by: daquexian <daquexian566@gmail.com>
Co-authored-by: oneflow-ci-bot <ci-bot@oneflow.org>
Co-authored-by: lixinqi <lixinqi0703106@163.com>
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Luyang <flowingsun007@163.com>
Co-authored-by: cheng cheng <472491134@qq.com>

* [PersistentTable] Fix num blocks (#7986)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* Add auto benchmark for flowvision (#7806)

* update yml

* update workflow

* add resnet50

* [PersistentTable] Async write (#7946)

* [PersistentTable] Async write

* fix

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* save log in separate dir by default (#7825)

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix index select op in graph

* add exe graph physical shape check msg

* improve the debug inform…

Loading branch information

52 people authored Aug 23, 2022

1 parent a4309ac commit 26a143e

.github/actions/whl/action.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -7,7 +7,7 @@ inputs: @@
         default: "10.2"
       python_version:
         description: "python_version"
-        default: "3.6"
+        default: "3.8"
       extra_flags:
         description: "flags like --xla"
         default: ""
@@ Expand Down @@

.github/workflows/canary.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -55,7 +55,7 @@ jobs: @@
           - name: Checkout Oneflow-Inc/oneflow
             if: ${{ github.event.inputs.oneflow-ref == '' }}
             uses: actions/checkout@v2
-          - uses: Oneflow-Inc/get-oneflow@single-matrix-for-efficiency
+          - uses: Oneflow-Inc/get-oneflow@support-iree-ci
             name: Build manylinux
             id: build-cuda
             with:
@@ Expand All / @@ -73,7 +73,6 @@ jobs: @@
               retry-failed-build: true
               clean-ccache: true
               python-versions: |
-.6
 .7
 .8
           - name: Upload wheelhouse
@@ Expand Down @@

.github/workflows/on_merge.yml

-Original file line number
+Diff line change
@@ Expand Up / @@ -15,6 +15,6 @@ jobs: @@
         if: github.event.pull_request.merged == true
         runs-on: ubuntu-latest
         steps:
-          - uses: Oneflow-Inc/get-oneflow/update-benchmark-history@single-matrix-for-efficiency
+          - uses: Oneflow-Inc/get-oneflow/update-benchmark-history@support-iree-ci
             name: Update benchmark history
             timeout-minutes: 10

.github/workflows/release.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -33,7 +33,7 @@ jobs:
  
            with:

              ref: ${{ github.event.pull_request.head.sha }}

              repository: ${{github.event.pull_request.head.repo.full_name}}

          - uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@single-matrix-for-efficiency

          - uses: Oneflow-Inc/get-oneflow/cache-complete/matrix/build@support-iree-ci

            name: find cache

            id: find-cache

            timeout-minutes: 5

    @@ -45,7 +45,7 @@ jobs:
  
                release

              oneflow-src: ${{ env.ONEFLOW_SRC }}

              entries: |

                cu115

                cu116

                cu112

                cu102

                cpu

    @@ -71,10 +71,10 @@ jobs:
  
          - name: Install dependencies

            run: |

              python3 -m pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

              python3 -m pip install -U pip setuptools wheel --user

              python3 -m pip install -U setuptools wheel --user

              python3 -m pip install oss2  --user

          - uses: actions/checkout@v2

          - uses: Oneflow-Inc/get-oneflow@single-matrix-for-efficiency

          - uses: Oneflow-Inc/get-oneflow@support-iree-ci

            name: Build ${{ matrix.entry }}

            if: ${{ matrix.entry !='cpu' }}

            with:

    @@ -93,12 +93,11 @@ jobs:
  
              clean-ccache: true

              nightly: ${{ github.event_name == 'schedule' }}

              python-versions: |

                3.6

                3.7

                3.8

                3.9

                3.10

          - uses: Oneflow-Inc/get-oneflow@single-matrix-for-efficiency

          - uses: Oneflow-Inc/get-oneflow@support-iree-ci

            name: Build ${{ matrix.entry }}

            if: ${{ matrix.entry =='cpu' }}

            with:

    @@ -117,7 +116,6 @@ jobs:
  
              clean-ccache: false

              nightly: ${{ github.event_name == 'schedule' || github.ref == 'refs/heads/master'}}

              python-versions: |

                3.6

                3.7

                3.8

                3.9

.github/workflows/simple.yml

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -245,7 +245,7 @@ jobs:
  
              repository: Oneflow-Inc/conda-env

              ref: 30a7f00eb48ee9009d85a848e720823e5054c66b

              path: conda-env

          - uses: Oneflow-Inc/get-oneflow@single-matrix-for-efficiency

          - uses: Oneflow-Inc/get-oneflow@support-iree-ci

            name: Build with gcc7

            if: ${{ matrix.build-type == 'gcc7'}}

            with:

    @@ -254,7 +254,7 @@ jobs:
  
              oneflow-build-env: conda

              conda-env-file: conda-env/dev/gcc7/environment-v2.yml

              conda-env-name: oneflow-dev-gcc7-v2

          - uses: Oneflow-Inc/get-oneflow@single-matrix-for-efficiency

          - uses: Oneflow-Inc/get-oneflow@support-iree-ci

            name: Build with clang10

            if: ${{ matrix.build-type == 'clang10'}}

            with:

0 comments on commit `26a143e`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `26a143e`

Commit

There are no files selected for viewing

0 comments on commit 26a143e

0 comments on commit `26a143e`