Merge back 2.2 (#4098)

* update for releases 2.2.0rc0 * Fix Classification explain forward issue (#3867) Fix bug * Fix e2e code error (#3871) * Update test_cli.py * Update tests/e2e/cli/test_cli.py Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com> * Update test_cli.py * Update test_cli.py --------- Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com> * Add documentation about configurable input size (#3870) * add docs about configurable input size * update api usecase and fix bug * Fix zero-shot e2e (#3876) Fix * Fix DeiT for multi-label classification (#3881) Remove init_args * Fix Semi-SL for ViT accuracy drop (#3883) Remove init_args * Update docs for 2.2 (#3884) Update docs * Fix mean and scale for segmentation task (#3885) fix mean and scale * Update MAPI in 2.2 (#3889) * Bump MAPI * Update exportable code requirements * Improve Semi-SL for LiteHRNet (small-medium case) (#3891) * change drop pixels value * go safe, change only tested models * minor * Improve h-cls for eff models (#3893) * Update step size for eff v2 * Update effb0 recipe * Fix maskrcnn swin nncf acc drop (#3900) update maskrcnn swimt model type to transformer * Add keypoint detection recipe for single object cases (#3903) * add rtmpose_tiny for single obj * add rtmpose_tiny for single obj * modify test subset name * fix unit test * update recipe with reset * Improve acc drop of efficientnetv2 for h-label cls (#3907) * Add warmup_iters for effv2 * Update max_epochs * Fix pretrained weight cached dir for timm (#3909) * Fix pretrained_weight for timm * Fix unit-test * Fix keypoint detection single obj recipe (#3915) * add rtmpose_tiny for single obj * modify test subset name * fix unit test * property for pck * Fix cached dir for timm & hugging-face (#3914) * Fix cached dir * Pretrained weight download unit-test * Fix pre-commit * Fix wrong template id mapping for anomaly (#3916) * Update script to allow setting otx version using env. variable (#3913) * Fix Datamodule creation for OV in AutoConfigurator (#3920) Fix datamodule for ov * Update tpp file for 2.2.0 (#3921) * Fix names for ignored scope [HOT-FIX, 2.2.0] (#3924) fix names for ignored scope * Fix classification rt_info (#3922) * Restore output_raw_scores for classificaiton * Add uts * Fix linter * Update label info (#3925) add label info to init Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Fix binary classification metric task (#3928) * Fix binary classification * Add unit-tests * Improve MaskRCNN SwinT NNCF (#3929) * ignore heads and disable smooth quant * add activations_range_estimator_params * update changelog * Fix get_item for Chained Tasks in Classification (#3931) * Fix Task Chain * Add multi-label case as well * Add multi-label case as well2 * Add H-label case * Correct Keyerror for h-label cls in label_groups for dm_label_categories using label's id/key (#3932) Modify label_groups for dm_label_categories with id/key of label * Remove datumaro attribute id from tiling, add subset names (#3933) * remove datumaro attribute id from tiling * add subset names * Fix soft predictions for Semantic Segmentation (#3934) fix soft preds * Update STFPM config (#3935) * Add missing pretrained weights when creating a docker image (#3938) * Fix pre-trained weight downloader * Remove if condition for pretrained wiehgt download * Change default option 'full' to 'base' in otx install (#3937) * Change option full to base for otx install * Fix wrong code * Fix issue * Fix docs * Fix auto adapt batch size in Converter (#3939) * Enable auto adapt batch size into converter * Fix wrong * Fix hpo converter (#3940) * save best hp after hpo * add test * Fix tiling XAI out of range (#3943) - Fix tile merge XAI out of range * enable model export (#3952) Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Move templates from OTX1.X to OTX2.X (#3951) * add otx1.6 templates * added new models * delete entrypoints and nncf cfg * updated some hyperparams * fix for rtmdet_tiny * updated converter * Update classification templates * Update det, r-det, vpm * Update template.yaml * changed warmaup value in train.yaml --------- Co-authored-by: Kang, Harim <harim.kang@intel.com> Co-authored-by: Kim, Sungchul <sungchul.kim@intel.com> * Add missing tile recipes and various tile recipe changes (#3942) * add missing tile recipes * Fix tiling XAI out of range (#3943) - Fix tile merge XAI out of range * update xai tile merge * update rtdetr * update tile recipes * update rtdetr tile postprocess * update rtdetr recipes and tile recipes * update tile recipes * fix rtdetr unittest * update recipes * refactor tile unit test * address pr reviews * remove unnecessary files * update color channel * fix image channel passing * include tiling in cli integration test * remove transform_bbox --------- Co-authored-by: Vladislav Sovrasov <sovrasov.vlad@gmail.com> * Support ImageFromBytes (#3948) * add image_from_bytes Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * refactor code Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * allow empty anomalous masks Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> --------- Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Change categories mapping logic (#3946) * change pre-filtering logic * Update src/otx/core/data/pre_filtering.py Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com> --------- Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com> * Update for 2.2.0rc1 (#3956) * Include Geti arrow dataset subset names (#3962) * restrited number of output masks by tiling * add geti subset name * update num of max pred * Include full image with anno in case there's no tile in tile dataset (#3964) * include full image with anno incase there's no tile in dataset * update test * Add type checker in converter for callable functions (optimizer, scheduler) (#3968) Fix converter callable functions (optimizer, scheduler) * Update for 2.2.0rc2 (#3969) update for 2.2.0rc2 * Fix config converter for tiling (#3973) fix config converter for tiling * Update for 2.2.0rc3 (#3975) * Change sematic segmentation to consider bbox only annotations. (#3996) * segmentation consider bbox only annotations * add unit test * add unit test * update fixture * use name attribute * revert tox file * update for 2.2.0rc4 --------- Co-authored-by: Yunchu Lee <yunchu.lee@intel.com> * Relieve memory usage criteria on batch size 2 during adaptive_bs (#4009) * release memory usage cirteria on batch size 2 during adpative_bs * update unit test * update unit test * Remove background label from RT Info for segmentation task (#4011) * remove background from rt_info * provide another solution * fix unit test * Fix num_trials calculation on dataset length less than num_class (#4014) Fix balanced sampler * Fix out_features in HierarchicalCBAMClsHead (#4016) Fix out_features * Fix empty anno (#4010) * Refactor mask_target_single function to handle unsupported ground truth mask types and provide warnings for missing ground truth masks * Refactor bbox_overlaps function to handle unsupported ground truth mask types and provide warnings for missing ground truth masks * Refactor export script to export multiple directories * Refactor test_bbox_overlaps_2d to handle mismatched batch dimensions of bboxes * Refactor bbox_overlaps function error exception * update changelog --------- Co-authored-by: Harim Kang <harim.kang@intel.com> * Update for release 2.2.0rc5 (#4015) * Prevent using too low confidence thresholds in detection (#4018) Prevent writing too low confidence thresholds to MAPI configuration * Update for release 2.2.0rc6 (#4027) * Update pre-merge workflow (#4032) * Update HPO interface (#4035) * update hpo interface * update unit test * update CHANGELOG.md * Enable keypoint detection training through config conversion (#4034) enable keypoint det config converter * Update for release 2.2.0rc7 (#4036) update for release 2.2.0rc7 * Fix multilabel_accuracy of MixedHLabelAccuracy (#4042) * Fix metric for multi-label * Fix1 * Add CHANGELOG * Update for release 2.2.0rc8 (#4043) * Fix wrong indices setting in HLabelInfo (#4044) * Fix wrong indices setting in label_info * Add unit-test & update for releases * Add legacy template LiteHRNet_18 template (#4049) added legacy template * Model templates: rename model_status value 'DISCONTINUED' to 'OBSOLETE' (#4051) rename 'DISCONTINUED' to 'OBSOLETE' in model templates * Enable export of feature vectors for semantic segmentation task (#4055) * Upgrade MAPI in 2.2 (#4052) * Update MRCNN model export to include feature vector and saliency map (#4056) * Fix applying model's hparams when loading model from checkpoint (#4057) * Update anomaly transforms (#4059) * Update transforms Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Update transforms Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Update changelog Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> * Update __init__.py --------- Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> Co-authored-by: Emily Chun <emily.chun@intel.com> * Bump onnx to 1.17.0 to omit CVE-2024-5187 (#4063) * Fix incorrect all_groups order configuration in HLabelInfo (#4067) * Fix all_labels * Update CHAGELOG * label_groups change * Fix wrong model name in converter & template (#4082) * Fix wrong * Update CHAGELOG * RTMDet Inst Seg Explain Mode for 2.2 (#4083) * Explain mode for RTMDet Inst Seg * Update changelog * reformat changelog * Fix rtdetr recipes (#4079) * Fix recipes * Update CHANGELOG * Enable adaptive_bs with Efficientnet-V2-L model template (#4085) Enable adaptive_bs with Efficientnet-V2-L model * Add Keypoint Detection legacy template (#4094) added rtmpose_template * fix template * Revert the old workaround for detection confidence threshold (#4096) Revert the old workaround * OTX RC 2.2 version up (#4099) * Update changelog * OTX version up * Fix linter * fix linter * Add dummy XAI to RTDETR (export mode) & disable strong aug (#4106) * Implement warning for unsupported explain mode in DETR model and update transform probabilities to zero in RTDETR recipes * update changelog * Update photometric distortion probability in RTDETR recipes * Fix task chain for Det -> Cls / Seg (#4105) * fix linter * return recipe back * added roi extraction for multi cllass classification datasett * fix linter * add same logic to semantic seg * added test for OTXDataset * add clip and raise an error when coordinates are invalid. * rewrite value error * minor change to CHANGELOG * fix linter * fix diffusion * fix tiling * Disable tiling classifier toggle in configurable parameters (#4107) * Disable tiling classifier toggle in configurable parameters * Update changelog * fix RTDETR * fix test with augs * switch off the IS for test_augs * remove FilterAnnotations for RTMdet * Update keypoint detection template (#4114) * added default template * update field * quick fix for rtmdet * minor update * minor fix --------- Signed-off-by: Ashwin Vaidya <ashwinnitinvaidya@gmail.com> Co-authored-by: Yunchu Lee <yunchu.lee@intel.com> Co-authored-by: Harim Kang <harim.kang@intel.com> Co-authored-by: Emily Chun <emily.chun@intel.com> Co-authored-by: Eunwoo Shin <eunwoo.shin@intel.com> Co-authored-by: Kim, Sungchul <sungchul.kim@intel.com> Co-authored-by: Vladislav Sovrasov <sovrasov.vlad@gmail.com> Co-authored-by: Sooah Lee <sooah.lee@intel.com> Co-authored-by: Eugene Liu <eugene.liu@intel.com> Co-authored-by: Wonju Lee <wonju.lee@intel.com> Co-authored-by: Ashwin Vaidya <ashwin.vaidya@intel.com> Co-authored-by: Leonardo Lai <leonardo.lai@intel.com>
openvinotoolkit · Nov 14, 2024 · 45e79b6 · 45e79b6
1 parent ac2393f
commit 45e79b6
Show file tree

Hide file tree

Showing 29 changed files with 776 additions and 102 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -94,6 +94,10 @@ All notable changes to this project will be documented in this file.
   (<https://github.com/openvinotoolkit/training_extensions/pull/3788>)
 - Add diffusion task
   (<https://github.com/openvinotoolkit/training_extensions/pull/3875>)
+- Revert the old workaround for detection confidence threshold
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4096>)
+- Add Keypoint Detection legacy template
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4094>)
 
 ### Enhancements
 
@@ -125,6 +129,8 @@ All notable changes to this project will be documented in this file.
   (<https://github.com/openvinotoolkit/training_extensions/pull/4009>)
 - Remove background label from RT Info for segmentation task
   (<https://github.com/openvinotoolkit/training_extensions/pull/4011>)
+- Enable export of the feature vectors for semantic segmentation task
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4055>)
 - Prevent using too low confidence thresholds in detection
   (<https://github.com/openvinotoolkit/training_extensions/pull/4018>)
 - Update HPO interface
@@ -162,8 +168,6 @@ All notable changes to this project will be documented in this file.
   (<https://github.com/openvinotoolkit/training_extensions/pull/4049>)
 - Model templates: rename model_status value 'DISCONTINUED' to 'OBSOLETE'
   (<https://github.com/openvinotoolkit/training_extensions/pull/4051>)
-- Enable export of feature vectors for semantic segmentation task
-  (<https://github.com/openvinotoolkit/training_extensions/pull/4055>)
 - Update MRCNN model export to include feature vector and saliency map
   (<https://github.com/openvinotoolkit/training_extensions/pull/4056>)
 - Upgrade MAPI in 2.2
@@ -172,6 +176,18 @@ All notable changes to this project will be documented in this file.
   (<https://github.com/openvinotoolkit/training_extensions/pull/4057>)
 - Fix incorrect all_groups order configuration in HLabelInfo
   (<https://github.com/openvinotoolkit/training_extensions/pull/4067>)
+- Fix RTDETR recipes
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4079>)
+- Fix wrong model name in converter & template
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4082>)
+- Fix RTMDet Inst Explain Mode
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4083>)
+- Fix RTDETR Explain Mode
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4106>)
+- Fix classification and semantic segmentation tasks, when ROI provided for images
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4105>)
+- Disable tiling classifier toggle in configurable parameters
+  (<https://github.com/openvinotoolkit/training_extensions/pull/4107>)
 
 ## \[v2.1.0\]
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -95,7 +95,7 @@ base = [
     "timm==1.0.3",
     "openvino==2024.4",
     "openvino-dev==2024.4",
-    "openvino-model-api==0.2.4",
+    "openvino-model-api==0.2.5",
     "onnx==1.17.0",
     "onnxconverter-common==1.14.0",
     "nncf==2.13.0",

diff --git a/src/otx/algo/detection/detectors/detection_transformer.py b/src/otx/algo/detection/detectors/detection_transformer.py
@@ -5,6 +5,7 @@
 
 from __future__ import annotations
 
+import warnings
 from typing import Any
 
 import numpy as np
@@ -95,16 +96,22 @@ def export(
         explain_mode: bool = False,
     ) -> dict[str, Any] | tuple[list[Any], list[Any], list[Any]]:
         """Exports the model."""
-        if explain_mode:
-            msg = "Explain mode is not supported for DETR models yet."
-            raise NotImplementedError(msg)
-
-        return self.postprocess(
+        results = self.postprocess(
             self._forward_features(batch_inputs),
             [meta["img_shape"] for meta in batch_img_metas],
             deploy_mode=True,
         )
 
+        if explain_mode:
+            # TODO(Eugene): Implement explain mode for DETR model.
+            warnings.warn("Explain mode is not supported for DETR model. Return dummy values.", stacklevel=2)
+            xai_output = {
+                "feature_vector": torch.zeros(1, 1),
+                "saliency_map": torch.zeros(1),
+            }
+            results.update(xai_output)  # type: ignore[union-attr]
+        return results
+
     def postprocess(
         self,
         outputs: dict[str, Tensor],

diff --git a/src/otx/algo/utils/xai_utils.py b/src/otx/algo/utils/xai_utils.py
@@ -225,7 +225,7 @@ def _get_image_data_name(
     subset = datamodule.subsets[subset_name]
     item = subset.dm_subset[img_id]
     img = item.media_as(Image)
-    img_data, _ = subset._get_img_data_and_shape(img)  # noqa: SLF001
+    img_data, _, _ = subset._get_img_data_and_shape(img)  # noqa: SLF001
     image_save_name = "".join([char if char.isalnum() else "_" for char in item.id])
     return img_data, image_save_name
 

diff --git a/src/otx/core/data/dataset/anomaly.py b/src/otx/core/data/dataset/anomaly.py
@@ -79,7 +79,7 @@ def _get_item_impl(
         datumaro_item = self.dm_subset[index]
         img = datumaro_item.media_as(Image)
         # returns image in RGB format if self.image_color_channel is RGB
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         label = self._get_label(datumaro_item)
 

diff --git a/src/otx/core/data/dataset/base.py b/src/otx/core/data/dataset/base.py
@@ -8,7 +8,7 @@
 from abc import abstractmethod
 from collections.abc import Iterable
 from contextlib import contextmanager
-from typing import TYPE_CHECKING, Callable, Generic, Iterator, List, Union
+from typing import TYPE_CHECKING, Any, Callable, Generic, Iterator, List, Union
 
 import cv2
 import numpy as np
@@ -92,6 +92,7 @@ def __init__(
         self.image_color_channel = image_color_channel
         self.stack_images = stack_images
         self.to_tv_image = to_tv_image
+
         if self.dm_subset.categories():
             self.label_info = LabelInfo.from_dm_label_groups(self.dm_subset.categories()[AnnotationType.label])
         else:
@@ -141,11 +142,30 @@ def __getitem__(self, index: int) -> T_OTXDataEntity:
         msg = f"Reach the maximum refetch number ({self.max_refetch})"
         raise RuntimeError(msg)
 
-    def _get_img_data_and_shape(self, img: Image) -> tuple[np.ndarray, tuple[int, int]]:
-        key = img.path if isinstance(img, ImageFromFile) else id(img)
+    def _get_img_data_and_shape(
+        self,
+        img: Image,
+        roi: dict[str, Any] | None = None,
+    ) -> tuple[np.ndarray, tuple[int, int], dict[str, Any] | None]:
+        """Get image data and shape.
+
+        This method is used to get image data and shape from Datumaro image object.
+        If ROI is provided, the image data is extracted from the ROI.
+
+        Args:
+            img (Image): Image object from Datumaro.
+            roi (dict[str, Any] | None, Optional): Region of interest.
+                Represented by dict with coordinates and some meta information.
 
-        if (img_data := self.mem_cache_handler.get(key=key)[0]) is not None:
-            return img_data, img_data.shape[:2]
+        Returns:
+                The image data, shape, and ROI meta information
+        """
+        key = img.path if isinstance(img, ImageFromFile) else id(img)
+        roi_meta = None
+        # check if the image is already in the cache
+        img_data, roi_meta = self.mem_cache_handler.get(key=key)
+        if img_data is not None:
+            return img_data, img_data.shape[:2], roi_meta
 
         with image_decode_context():
             img_data = (
@@ -158,11 +178,28 @@ def _get_img_data_and_shape(self, img: Image) -> tuple[np.ndarray, tuple[int, in
             msg = "Cannot get image data"
             raise RuntimeError(msg)
 
-        img_data = self._cache_img(key=key, img_data=img_data.astype(np.uint8))
+        if roi and isinstance(roi, dict):
+            # extract ROI from image
+            shape = roi["shape"]
+            h, w = img_data.shape[:2]
+            x1, y1, x2, y2 = (
+                int(np.clip(np.trunc(shape["x1"] * w), 0, w)),
+                int(np.clip(np.trunc(shape["y1"] * h), 0, h)),
+                int(np.clip(np.ceil(shape["x2"] * w), 0, w)),
+                int(np.clip(np.ceil(shape["y2"] * h), 0, h)),
+            )
+            if (x2 - x1) * (y2 - y1) <= 0:
+                msg = f"ROI has zero or negative area. ROI coordinates: {x1}, {y1}, {x2}, {y2}"
+                raise ValueError(msg)
+
+            img_data = img_data[y1:y2, x1:x2]
+            roi_meta = {"x1": x1, "y1": y1, "x2": x2, "y2": y2, "orig_image_shape": (h, w)}
+
+        img_data = self._cache_img(key=key, img_data=img_data.astype(np.uint8), meta=roi_meta)
 
-        return img_data, img_data.shape[:2]
+        return img_data, img_data.shape[:2], roi_meta
 
-    def _cache_img(self, key: str | int, img_data: np.ndarray) -> np.ndarray:
+    def _cache_img(self, key: str | int, img_data: np.ndarray, meta: dict[str, Any] | None = None) -> np.ndarray:
         """Cache an image after resizing.
 
         If there is available space in the memory pool, the input image is cached.
@@ -182,14 +219,14 @@ def _cache_img(self, key: str | int, img_data: np.ndarray) -> np.ndarray:
             return img_data
 
         if self.mem_cache_img_max_size is None:
-            self.mem_cache_handler.put(key=key, data=img_data, meta=None)
+            self.mem_cache_handler.put(key=key, data=img_data, meta=meta)
             return img_data
 
         height, width = img_data.shape[:2]
         max_height, max_width = self.mem_cache_img_max_size
 
         if height <= max_height and width <= max_width:
-            self.mem_cache_handler.put(key=key, data=img_data, meta=None)
+            self.mem_cache_handler.put(key=key, data=img_data, meta=meta)
             return img_data
 
         # Preserve the image size ratio and fit to max_height or max_width
@@ -206,7 +243,7 @@ def _cache_img(self, key: str | int, img_data: np.ndarray) -> np.ndarray:
         self.mem_cache_handler.put(
             key=key,
             data=resized_img,
-            meta=None,
+            meta=meta,
         )
         return resized_img
 

diff --git a/src/otx/core/data/dataset/classification.py b/src/otx/core/data/dataset/classification.py
@@ -32,18 +32,18 @@ class OTXMulticlassClsDataset(OTXDataset[MulticlassClsDataEntity]):
     def _get_item_impl(self, index: int) -> MulticlassClsDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        roi = item.attributes.get("roi", None)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img, roi)
+        if roi:
+            # extract labels from ROI
+            labels_ids = [
+                label["label"]["_id"] for label in roi["labels"] if label["label"]["domain"] == "CLASSIFICATION"
+            ]
+            label_anns = [self.label_info.label_names.index(label_id) for label_id in labels_ids]
+        else:
+            # extract labels from annotations
+            label_anns = [ann.label for ann in item.annotations if isinstance(ann, Label)]
 
-        label_anns = []
-        for ann in item.annotations:
-            if isinstance(ann, Label):
-                label_anns.append(ann)
-            else:
-                # If the annotation is not Label, it should be converted to Label.
-                # For Chained Task: Detection (Bbox) -> Classification (Label)
-                label = Label(label=ann.label)
-                if label not in label_anns:
-                    label_anns.append(label)
         if len(label_anns) > 1:
             msg = f"Multi-class Classification can't use the multi-label, currently len(labels) = {len(label_anns)}"
             raise ValueError(msg)
@@ -56,7 +56,7 @@ def _get_item_impl(self, index: int) -> MulticlassClsDataEntity | None:
                 ori_shape=img_shape,
                 image_color_channel=self.image_color_channel,
             ),
-            labels=torch.as_tensor([ann.label for ann in label_anns]),
+            labels=torch.as_tensor(label_anns),
         )
 
         return self._apply_transforms(entity)
@@ -78,7 +78,7 @@ def _get_item_impl(self, index: int) -> MultilabelClsDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []  # This should be assigned form item
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         label_anns = []
         for ann in item.annotations:
@@ -195,7 +195,7 @@ def _get_item_impl(self, index: int) -> HlabelClsDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []  # This should be assigned form item
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         label_anns = []
         for ann in item.annotations:

diff --git a/src/otx/core/data/dataset/detection.py b/src/otx/core/data/dataset/detection.py
@@ -26,7 +26,7 @@ def _get_item_impl(self, index: int) -> DetDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []  # This should be assigned form item
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         bbox_anns = [ann for ann in item.annotations if isinstance(ann, Bbox)]
 

diff --git a/src/otx/core/data/dataset/diffusion.py b/src/otx/core/data/dataset/diffusion.py
@@ -22,7 +22,7 @@ def _get_item_impl(self, idx: int) -> DiffusionDataEntity | None:
         item = self.dm_subset[idx]
         caption = item.annotations[0].caption
         img = item.media_as(Image)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
         entity = DiffusionDataEntity(
             image=img_data,
             img_info=ImageInfo(

diff --git a/src/otx/core/data/dataset/instance_segmentation.py b/src/otx/core/data/dataset/instance_segmentation.py
@@ -40,7 +40,7 @@ def _get_item_impl(self, index: int) -> InstanceSegDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         gt_bboxes, gt_labels, gt_masks, gt_polygons = [], [], [], []
 

diff --git a/src/otx/core/data/dataset/keypoint_detection.py b/src/otx/core/data/dataset/keypoint_detection.py
@@ -86,7 +86,7 @@ def _get_item_impl(self, index: int) -> KeypointDetDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []  # This should be assigned form item
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         bbox_anns = [ann for ann in item.annotations if isinstance(ann, Bbox)]
         bboxes = (

diff --git a/src/otx/core/data/dataset/object_detection_3d.py b/src/otx/core/data/dataset/object_detection_3d.py
@@ -58,7 +58,7 @@ def __init__(
     def _get_item_impl(self, index: int) -> Det3DDataEntity | None:
         entity = self.dm_subset[index]
         image = entity.media_as(Image)
-        image, ori_img_shape = self._get_img_data_and_shape(image)
+        image, ori_img_shape, _ = self._get_img_data_and_shape(image)
         calib = self.get_calib_from_file(entity.attributes["calib_path"])
         annotations_copy = deepcopy(entity.annotations)
         datumaro_kitti_format = [obj.attributes for obj in annotations_copy]

diff --git a/src/otx/core/data/dataset/segmentation.py b/src/otx/core/data/dataset/segmentation.py
@@ -203,9 +203,14 @@ def _get_item_impl(self, index: int) -> SegDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(Image)
         ignored_labels: list[int] = []
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        roi = item.attributes.get("roi", None)
+        img_data, img_shape, roi_meta = self._get_img_data_and_shape(img, roi)
         if item.annotations:
-            extracted_mask = _extract_class_mask(item=item, img_shape=img_shape, ignore_index=self.ignore_index)
+            ori_shape = roi_meta["orig_image_shape"] if roi_meta else img_shape
+            extracted_mask = _extract_class_mask(item=item, img_shape=ori_shape, ignore_index=self.ignore_index)
+            if roi_meta:
+                extracted_mask = extracted_mask[roi_meta["y1"] : roi_meta["y2"], roi_meta["x1"] : roi_meta["x2"]]
+
             masks = tv_tensors.Mask(extracted_mask[None])
         else:
             # semi-supervised learning, unlabeled dataset

diff --git a/src/otx/core/data/dataset/tile.py b/src/otx/core/data/dataset/tile.py
@@ -414,7 +414,7 @@ def _get_item_impl(self, index: int) -> TileDetDataEntity:  # type: ignore[overr
         """
         item = self.dm_subset[index]
         img = item.media_as(Image)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         bbox_anns = [ann for ann in item.annotations if isinstance(ann, Bbox)]
 
@@ -505,7 +505,7 @@ def _get_item_impl(self, index: int) -> TileInstSegDataEntity:  # type: ignore[o
         """
         item = self.dm_subset[index]
         img = item.media_as(Image)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         gt_bboxes, gt_labels, gt_masks, gt_polygons = [], [], [], []
 
@@ -607,7 +607,7 @@ def _get_item_impl(self, index: int) -> TileSegDataEntity:  # type: ignore[overr
         """
         item = self.dm_subset[index]
         img = item.media_as(Image)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         extracted_mask = _extract_class_mask(item=item, img_shape=img_shape, ignore_index=self.ignore_index)
         masks = tv_tensors.Mask(extracted_mask[None])

diff --git a/src/otx/core/data/dataset/visual_prompting.py b/src/otx/core/data/dataset/visual_prompting.py
@@ -79,7 +79,7 @@ def __init__(
     def _get_item_impl(self, index: int) -> VisualPromptingDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(dmImage)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         gt_bboxes, gt_points = [], []
         gt_masks = defaultdict(list)
@@ -229,7 +229,7 @@ def __init__(
     def _get_item_impl(self, index: int) -> ZeroShotVisualPromptingDataEntity | None:
         item = self.dm_subset[index]
         img = item.media_as(dmImage)
-        img_data, img_shape = self._get_img_data_and_shape(img)
+        img_data, img_shape, _ = self._get_img_data_and_shape(img)
 
         prompts: list[ZeroShotPromptType] = []
         gt_masks: list[tvMask] = []

diff --git a/src/otx/core/data/transform_libs/torchvision.py b/src/otx/core/data/transform_libs/torchvision.py
@@ -2650,6 +2650,7 @@ def forward(self, *_inputs: T_OTXDataEntity) -> T_OTXDataEntity | None:
         if not keep.any() and self.keep_empty:
             return self.convert(inputs)
 
+        keep = list(keep)
         keys = ("bboxes", "labels", "masks", "polygons")
         for key in keys:
             if hasattr(inputs, key):