Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prompt Depth Anything Model #35401

Open
wants to merge 43 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
24151d8
add prompt depth anything model by modular transformer
haotongl Dec 23, 2024
7e6dcaa
add prompt depth anything docs and imports
haotongl Dec 23, 2024
dfa7d67
update code style according transformers doc
haotongl Dec 23, 2024
8509440
update code style: import order issue is fixed by custom_init_isort
haotongl Dec 23, 2024
2fa72ef
fix depth shape from B,1,H,W to B,H,W which is as the same as Depth A…
haotongl Dec 23, 2024
d13a55f
move prompt depth anything to vision models in _toctree.yml
haotongl Dec 24, 2024
6cd1bbf
update backbone test; there is no need for resnet18 backbone test
haotongl Dec 24, 2024
76299f4
update init file & pass RUN_SLOW tests
haotongl Dec 24, 2024
2315dd1
update len(prompt_depth) to prompt_depth.shape[0]
haotongl Dec 25, 2024
c423e91
fix torch_int/model_doc
haotongl Dec 25, 2024
739c07f
fix typo
haotongl Dec 25, 2024
5c046e8
update PromptDepthAnythingImageProcessor
haotongl Dec 25, 2024
f3a8aa4
fix typo
haotongl Dec 25, 2024
c2647ca
fix typo for prompt depth anything doc
haotongl Dec 26, 2024
ea67b90
update promptda overview image link of huggingface repo
haotongl Jan 2, 2025
b2379d6
fix some typos in promptda doc
haotongl Jan 6, 2025
b9a44fb
Update image processing to include pad_image, prompt depth position, …
haotongl Jan 7, 2025
dfee43f
add copy disclaimer for prompt depth anything image processing
haotongl Jan 7, 2025
db9f301
fix some format typos in image processing and conversion scripts
haotongl Jan 7, 2025
8d0a435
fix nn.ReLU(False) to nn.ReLU()
haotongl Jan 7, 2025
89956c4
rename residual layer as it's a sequential layer
haotongl Jan 7, 2025
c713a5e
move size compute to a separate line/variable for easier debug in mod…
haotongl Jan 7, 2025
777c367
fix modular format for prompt depth anything
haotongl Jan 7, 2025
cc8f4ac
update modular prompt depth anything
haotongl Jan 7, 2025
0848054
fix scale to meter and some internal funcs warp
haotongl Jan 16, 2025
25e1144
fix code style in image_processing_prompt_depth_anything.py
haotongl Jan 16, 2025
3c8f6c0
fix issues in image_processing_prompt_depth_anything.py
haotongl Jan 16, 2025
cf24f48
fix issues in image_processing_prompt_depth_anything.py
haotongl Jan 16, 2025
fcd5107
fix issues in prompt depth anything
haotongl Jan 16, 2025
d9f6ecf
update converting script similar to mllamma
haotongl Jan 16, 2025
357cc12
update testing for modeling prompt depth anything
haotongl Jan 16, 2025
f79f912
update testing for image_processing_prompt_depth_anything
haotongl Jan 16, 2025
2aa3363
fix assertion in image_processing_prompt_depth_anything
haotongl Jan 16, 2025
17bd168
Update src/transformers/models/prompt_depth_anything/modular_prompt_d…
haotongl Jan 22, 2025
1d7a6d0
Update src/transformers/models/prompt_depth_anything/modular_prompt_d…
haotongl Jan 22, 2025
ab381ca
Update src/transformers/models/prompt_depth_anything/image_processing…
haotongl Jan 22, 2025
a509ad1
Update src/transformers/models/prompt_depth_anything/image_processing…
haotongl Jan 22, 2025
188b88d
Update src/transformers/models/prompt_depth_anything/image_processing…
haotongl Jan 22, 2025
9d48d97
Update docs/source/en/model_doc/prompt_depth_anything.md
haotongl Jan 22, 2025
c033e6c
Update docs/source/en/model_doc/prompt_depth_anything.md
haotongl Jan 22, 2025
c2693f8
update some testing
haotongl Jan 26, 2025
d957f56
fix testing
haotongl Jan 26, 2025
b34e35a
fix
haotongl Jan 26, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,8 @@
title: NAT
- local: model_doc/poolformer
title: PoolFormer
- local: model_doc/prompt_depth_anything
title: Prompt Depth Anything
- local: model_doc/pvt
title: Pyramid Vision Transformer (PVT)
- local: model_doc/pvt_v2
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,7 @@ Flax), PyTorch, and/or TensorFlow.
| [PLBart](model_doc/plbart) | ✅ | ❌ | ❌ |
| [PoolFormer](model_doc/poolformer) | ✅ | ❌ | ❌ |
| [Pop2Piano](model_doc/pop2piano) | ✅ | ❌ | ❌ |
| [PromptDepthAnything](model_doc/prompt_depth_anything) | ✅ | ❌ | ❌ |
| [ProphetNet](model_doc/prophetnet) | ✅ | ❌ | ❌ |
| [PVT](model_doc/pvt) | ✅ | ❌ | ❌ |
| [PVTv2](model_doc/pvt_v2) | ✅ | ❌ | ❌ |
Expand Down
96 changes: 96 additions & 0 deletions docs/source/en/model_doc/prompt_depth_anything.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Prompt Depth Anything

## Overview

The Prompt Depth Anything model was introduced in [Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation](https://arxiv.org/abs/2412.14015) by Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, Bingyi Kang.


The abstract from the paper is as follows:

*Prompts play a critical role in unleashing the power of language and vision foundation models for specific tasks. For the first time, we introduce prompting into depth foundation models, creating a new paradigm for metric depth estimation termed Prompt Depth Anything. Specifically, we use a low-cost LiDAR as the prompt to guide the Depth Anything model for accurate metric depth output, achieving up to 4K resolution. Our approach centers on a concise prompt fusion design that integrates the LiDAR at multiple scales within the depth decoder. To address training challenges posed by limited datasets containing both LiDAR depth and precise GT depth, we propose a scalable data pipeline that includes synthetic data LiDAR simulation and real data pseudo GT depth generation. Our approach sets new state-of-the-arts on the ARKitScenes and ScanNet++ datasets and benefits downstream applications, including 3D reconstruction and generalized robotic grasping.*

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/prompt_depth_anything_architecture.jpg"
alt="drawing" width="600"/>

<small> Prompt Depth Anything overview. Taken from the <a href="https://arxiv.org/pdf/2412.14015">original paper</a>.</small>

## Usage example

The Transformers library allows you to use the model with just a few lines of code:

```python
>>> import torch
>>> import requests
>>> import numpy as np

>>> from PIL import Image
>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/image.jpg?raw=true"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
>>> model = AutoModelForDepthEstimation.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")

>>> prompt_depth_url = "https://github.com/DepthAnything/PromptDA/blob/main/assets/example_images/arkit_depth.png?raw=true"
>>> prompt_depth = Image.open(requests.get(prompt_depth_url, stream=True).raw)
haotongl marked this conversation as resolved.
Show resolved Hide resolved
>>> # the prompt depth can be None, and the model will output a monocular relative depth.

>>> # prepare image for the model
>>> inputs = image_processor(images=image, return_tensors="pt", prompt_depth=prompt_depth)

>>> with torch.no_grad():
... outputs = model(**inputs)

>>> # interpolate to original size
>>> post_processed_output = image_processor.post_process_depth_estimation(
... outputs,
... target_sizes=[(image.height, image.width)],
... )

>>> # visualize the prediction
>>> predicted_depth = post_processed_output[0]["predicted_depth"]
>>> depth = predicted_depth * 1000
>>> depth = depth.detach().cpu().numpy()
>>> depth = Image.fromarray(depth.astype("uint16")) # mm
```

## Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with Prompt Depth Anything.

- [Prompt Depth Anything Demo](https://huggingface.co/spaces/depth-anything/PromptDA)
- [Prompt Depth Anything Interactive Results](https://promptda.github.io/interactive.html)

If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

## PromptDepthAnythingConfig

[[autodoc]] PromptDepthAnythingConfig

## PromptDepthAnythingForDepthEstimation

[[autodoc]] PromptDepthAnythingForDepthEstimation
- forward

## PromptDepthAnythingImageProcessor

[[autodoc]] PromptDepthAnythingImageProcessor
- preprocess
- post_process_depth_estimation
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -689,6 +689,7 @@
"models.plbart": ["PLBartConfig"],
"models.poolformer": ["PoolFormerConfig"],
"models.pop2piano": ["Pop2PianoConfig"],
"models.prompt_depth_anything": ["PromptDepthAnythingConfig"],
"models.prophetnet": [
"ProphetNetConfig",
"ProphetNetTokenizer",
Expand Down Expand Up @@ -1246,6 +1247,7 @@
_import_structure["models.pix2struct"].extend(["Pix2StructImageProcessor"])
_import_structure["models.pixtral"].append("PixtralImageProcessor")
_import_structure["models.poolformer"].extend(["PoolFormerFeatureExtractor", "PoolFormerImageProcessor"])
_import_structure["models.prompt_depth_anything"].extend(["PromptDepthAnythingImageProcessor"])
_import_structure["models.pvt"].extend(["PvtImageProcessor"])
_import_structure["models.qwen2_vl"].extend(["Qwen2VLImageProcessor"])
_import_structure["models.rt_detr"].extend(["RTDetrImageProcessor"])
Expand Down Expand Up @@ -3181,6 +3183,12 @@
"Pop2PianoPreTrainedModel",
]
)
_import_structure["models.prompt_depth_anything"].extend(
[
"PromptDepthAnythingForDepthEstimation",
"PromptDepthAnythingPreTrainedModel",
]
)
_import_structure["models.prophetnet"].extend(
[
"ProphetNetDecoder",
Expand Down Expand Up @@ -5682,6 +5690,7 @@
from .models.pop2piano import (
Pop2PianoConfig,
)
from .models.prompt_depth_anything import PromptDepthAnythingConfig
from .models.prophetnet import (
ProphetNetConfig,
ProphetNetTokenizer,
Expand Down Expand Up @@ -6260,6 +6269,7 @@
PoolFormerFeatureExtractor,
PoolFormerImageProcessor,
)
from .models.prompt_depth_anything import PromptDepthAnythingImageProcessor
from .models.pvt import PvtImageProcessor
from .models.qwen2_vl import Qwen2VLImageProcessor
from .models.rt_detr import RTDetrImageProcessor
Expand Down Expand Up @@ -7819,6 +7829,10 @@
Pop2PianoForConditionalGeneration,
Pop2PianoPreTrainedModel,
)
from .models.prompt_depth_anything import (
PromptDepthAnythingForDepthEstimation,
PromptDepthAnythingPreTrainedModel,
)
from .models.prophetnet import (
ProphetNetDecoder,
ProphetNetEncoder,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,7 @@
plbart,
poolformer,
pop2piano,
prompt_depth_anything,
prophetnet,
pvt,
pvt_v2,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,7 @@
("plbart", "PLBartConfig"),
("poolformer", "PoolFormerConfig"),
("pop2piano", "Pop2PianoConfig"),
("prompt_depth_anything", "PromptDepthAnythingConfig"),
("prophetnet", "ProphetNetConfig"),
("pvt", "PvtConfig"),
("pvt_v2", "PvtV2Config"),
Expand Down Expand Up @@ -554,6 +555,7 @@
("plbart", "PLBart"),
("poolformer", "PoolFormer"),
("pop2piano", "Pop2Piano"),
("prompt_depth_anything", "PromptDepthAnything"),
("prophetnet", "ProphetNet"),
("pvt", "PVT"),
("pvt_v2", "PVTv2"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/image_processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@
("pix2struct", ("Pix2StructImageProcessor",)),
("pixtral", ("PixtralImageProcessor", "PixtralImageProcessorFast")),
("poolformer", ("PoolFormerImageProcessor",)),
("prompt_depth_anything", ("PromptDepthAnythingImageProcessor",)),
("pvt", ("PvtImageProcessor",)),
("pvt_v2", ("PvtImageProcessor",)),
("qwen2_vl", ("Qwen2VLImageProcessor",)),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -893,6 +893,7 @@
("depth_anything", "DepthAnythingForDepthEstimation"),
("dpt", "DPTForDepthEstimation"),
("glpn", "GLPNForDepthEstimation"),
("prompt_depth_anything", "PromptDepthAnythingForDepthEstimation"),
("zoedepth", "ZoeDepthForDepthEstimation"),
]
)
Expand Down
31 changes: 31 additions & 0 deletions src/transformers/models/prompt_depth_anything/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
NielsRogge marked this conversation as resolved.
Show resolved Hide resolved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_prompt_depth_anything import PromptDepthAnythingConfig
from .image_processing_prompt_depth_anything import PromptDepthAnythingImageProcessor
from .modeling_prompt_depth_anything import (
PromptDepthAnythingForDepthEstimation,
PromptDepthAnythingPreTrainedModel,
)
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading