openvinotoolkit · vinnamkim · Oct 19, 2023 · Oct 12, 2023 · Oct 14, 2023 · Oct 14, 2023
@@ -16,6 +16,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
   (<https://github.com/openvinotoolkit/datumaro/pull/1162>)
 - Fix hyperlink errors in the document
   (<https://github.com/openvinotoolkit/datumaro/pull/1159>, <https://github.com/openvinotoolkit/datumaro/pull/1161>)
+- Fix memory unbounded Arrow data format export/import
+  (<https://github.com/openvinotoolkit/datumaro/pull/1169>)
 
 ## 15/09/2023 - Release 1.5.0
 ### New features

@@ -178,13 +178,10 @@ Extra options for exporting to Arrow format:
   - `JPEG/95`: [JPEG](https://en.wikipedia.org/wiki/JPEG) with 95 quality
   - `JPEG/75`: [JPEG](https://en.wikipedia.org/wiki/JPEG) with 75 quality
   - `NONE`: skip saving image.
-- `--max-chunk-size MAX_CHUNK_SIZE` allow to specify maximum chunk size (batch size) when saving into arrow format.
+- `--max-shard-size MAX_SHARD_SIZE` allow to specify maximum number of dataset items when saving into arrow format.
   (default: `1000`)
 - `--num-shards NUM_SHARDS` allow to specify the number of shards to generate.
-  `--num-shards` and `--max-shard-size` are  mutually exclusive.
-  (default: `1`)
-- `--max-shard-size MAX_SHARD_SIZE` allow to specify maximum size of each shard. (e.g. 7KB = 7 \* 2^10, 3MB = 3 \* 2^20, and 2GB = 2 \* 2^30)
-  `--num-shards` and `--max-shard-size` are  mutually exclusive.
+  `--num-shards` and `--max-shard-size` are mutually exclusive.
   (default: `None`)
 - `--num-workers NUM_WORKERS` allow to multi-processing for the export. If num_workers = 0, do not use multiprocessing (default: `0`).
 

@@ -178,13 +178,13 @@ def media_type(_):
 
         return _DatasetFilter()
 
-    def infos(self):
+    def infos(self) -> DatasetInfo:
         return {}
 
-    def categories(self):
+    def categories(self) -> CategoriesInfo:
         return {}
 
-    def get(self, id, subset=None):
+    def get(self, id, subset=None) -> Optional[DatasetItem]:
         subset = subset or DEFAULT_SUBSET_NAME
         for item in self:
             if item.id == id and item.subset == subset:

@@ -319,7 +319,7 @@ def _require_files_iter(
     @contextlib.contextmanager
     def probe_text_file(
         self, path: str, requirement_desc: str, is_binary_file: bool = False
-    ) -> Union[BufferedReader, TextIO]:
+    ) -> Iterator[Union[BufferedReader, TextIO]]:
         """
         Returns a context manager that can be used to place a requirement on
         the contents of the file referred to by `path`. To do so, you must