Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prompt Depth Anything Model #35401

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

haotongl
Copy link

What does this PR do?

This PR adds the Prompt Depth Anything Model. Prompt Depth Anything builds upon Depth Anything V2 and incorporates metric prompt depth to enable accurate and high-resolution metric depth estimation.

The implementation leverages Modular Transformers. The main file can be found here.

Before submitting

@haotongl
Copy link
Author

haotongl commented Dec 24, 2024

@NielsRogge @qubvel @pcuenca Could you help review this PR when you have some time? Thanks so much in advance! Let me know if you have any questions or suggestions. 😊

@qubvel
Copy link
Member

qubvel commented Dec 24, 2024

Hi @haotongl! Thanks for working on the model integration to transformers 🤗 I'm on holidays until Jan 3rd, and I'll do a review after that if it's still necessary.

@haotongl haotongl requested review from NielsRogge and xenova January 2, 2025 16:57
@haotongl
Copy link
Author

haotongl commented Jan 3, 2025

Hi, @xenova @NielsRogge ! All suggestions have been addressed. Could you please take another look and provide any further suggestions, or go ahead and merge this PR? Thanks!

@NielsRogge NielsRogge requested a review from qubvel January 6, 2025 08:24
@haotongl haotongl requested a review from NielsRogge January 6, 2025 09:19
Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on the model addition 🤗 Great work in terms of the model and in terms of porting to transformers! Please see the comments below

Comment on lines +434 to +446
if prompt_depth is not None:
# prompt_depth is a list of images with shape (height, width)
# we need to convert it to a list of images with shape (1, height, width)
prompt_depths = make_list_of_images(prompt_depth)
prompt_depths = [to_numpy_array(depth) for depth in prompt_depths]
prompt_depths = [depth * self.prompt_scale_to_meter for depth in prompt_depths]
prompt_depths = [prompt_depth[..., None].astype(np.float32) for prompt_depth in prompt_depths]
prompt_depths = [
to_channel_dimension_format(depth, data_format, input_channel_dim=input_data_format)
for depth in prompt_depths
]
data["prompt_depth"] = prompt_depths
return BatchFeature(data=data, tensor_type=return_tensors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we resize/pad prompt depth?

Copy link
Member

@qubvel qubvel Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also from the paper:

Depth normalization. The irregular range of input depth
data can hinder network convergence. To address this, we
normalize the LiDAR data using linear scaling to the range
[0, 1], based on its minimum and maximum values. The network output is also normalized with the same scaling factor
from LiDAR data, ensuring consistent scales and facilitating easier convergence during training.

Should this normalization be added to the preprocessing?
+ we will need offset/scale to make backward transformation of predicted depth in post-processing method

Copy link
Author

@haotongl haotongl Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qubvel The size of the prompt depth should remain unchanged to preserve its original information, allowing the model to fully utilize the prompt depth data to the greatest extent. Additionally, the prompt depth should not be normalized, as maintaining the original depth range is essential during runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, for the current checkpoint, the size is {'height': 756, 'width': 756}, which is a multiple of 14, and the image is not padded. However in case one would like to run the model on another image size that is not multiple of 14 the image will be padded, however, the depth map will not, let's assume the following:

image size: 512x512
padded image size: 518x518 (6 empty pixels on the bottom and right borders)

prompt_depth size: 256x256 (does not have empty pixels on borders to align with padded image)

probably, not a big deal, cause we are not merging image and prompt depth directly, but merging features, however might lead to a slight shift of image features relative to prompt depth features

@haotongl haotongl requested a review from qubvel January 7, 2025 16:29
@haotongl
Copy link
Author

haotongl commented Jan 7, 2025

@qubvel @NielsRogge All suggestions have been addressed. Could you please take another look and provide any further suggestions, or go ahead and merge this PR? Thanks!

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, great work! We are getting close to merge it 🤗 Please, see the comment below

Comment on lines +286 to +298
prompt_depth: ImageInput = None,
do_resize: bool = None,
size: int = None,
keep_aspect_ratio: bool = None,
ensure_multiple_of: int = None,
resample: PILImageResampling = None,
do_rescale: bool = None,
rescale_factor: float = None,
do_normalize: bool = None,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_pad: bool = None,
size_divisor: int = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add Optional[] for args with None default

Comment on lines +67 to +76
def constrain_to_multiple_of(val, multiple, min_val=0, max_val=None):
x = round(val / multiple) * multiple

if max_val is not None and x > max_val:
x = math.floor(val / multiple) * multiple

if x < min_val:
x = math.ceil(val / multiple) * multiple

return x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this out of function scope + make private with _constrain_to_multiple_of

logger = logging.get_logger(__name__)


def get_resize_output_image_size(
Copy link
Member

@qubvel qubvel Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add short docstring and make it private

Suggested change
def get_resize_output_image_size(
def _get_resize_output_image_size(


def pad_image(
self,
image: np.array,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
image: np.array,
image: np.ndarray,

return_tensors: Optional[Union[str, TensorType]] = None,
data_format: ChannelDimension = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> PIL.Image.Image:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> PIL.Image.Image:
) -> BatchFeature:

inputs = image_processor(images=image, return_tensors="pt").to(torch_device)

with torch.no_grad():
outputs = model(pixel_values=inputs.pixel_values, prompt_depth=prompt_depth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outputs = model(**inputs)

@require_vision
@slow
class PromptDepthAnythingModelIntegrationTest(unittest.TestCase):
def test_inference(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add test for both cases: with and without prompt depth

Comment on lines +287 to +291
exported_program = torch.export.export(
model,
args=(inputs["pixel_values"],),
strict=strict,
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should work also for prompt_depth if we move invalid_mask.any() to processor

.to(torch_device)
.eval()
)
image_processor = DPTImageProcessor.from_pretrained("depth-anything/prompt-depth-anything-vits-hf")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrong image processor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need tests for image processor as well (in a separate file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants