clp-package: Add support for printing real-time compression statistics. #388

wraymo · 2024-05-10T03:26:54Z

References

Description

For large jobs, users often experience extended waiting periods until completion. During that time, they have no idea about the compression status. This PR addresses this issue by printing out real-time statistics (compression ratio and speed) during compression. Specifically, this PR introduces the code changes listed below:

CompressionTaskFailureResult and CompressionTaskSuccessResult are removed.
The compression scheduler does not update the completion status of compression tasks.
Instead, the completion status of compression tasks are updated within each task and the fields (i.e. compressed_size, uncompressed_size and num_tasks_completed) of compression jobs are updated at the same time.
For the user compression script, completion_query is removed and compression statistics are now printed promptly upon their update.

Validation performed

Built and started the package
Compressed the hive-24hrs dataset. It printed out statistics like this

2024-05-10 03:22:52,628 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compression job 6 submitted.
2024-05-10 03:23:02,159 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compressed 159.87MB into 4.14MB (38.58). Speed: 48.95MB/s.
2024-05-10 03:23:04,165 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compressed 928.14MB into 23.46MB (39.56). Speed: 176.05MB/s.
2024-05-10 03:23:04,667 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compressed 1.41GB into 36.32MB (39.73). Speed: 249.91MB/s.
2024-05-10 03:23:05,670 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compressed 1.66GB into 40.11MB (42.36). Speed: 250.69MB/s.
2024-05-10 03:23:06,673 [INFO] [/opt/clp/lib/python3/site-packages/clp_package_utils/scripts/native/compress.py] Compression finished. Compressed 1.99GB into 44.37MB (45.89). Speed: 261.66MB/s.

kirkrodrigues · 2024-05-10T09:21:08Z

Can you summarize the approach taken? I.e., at a high level, what is the code change.

wraymo · 2024-05-10T14:47:04Z

Can you summarize the approach taken? I.e., at a high level, what is the code change.

Updated

… failed tasks

kirkrodrigues · 2024-05-13T21:11:52Z

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

+                    compression_ratio = float(job_uncompressed_size) / job_compressed_size
+                    compression_speed = (
+                        job_uncompressed_size
+                        / (current_time - job_row["start_time"]).total_seconds()
+                    )
+                    logger.info(
+                        f"Compressed {pretty_size(job_uncompressed_size)} into "
+                        f"{pretty_size(job_compressed_size)} ({compression_ratio:.2f}). "
+                        f"Speed: {pretty_size(compression_speed)}/s."
+                    )
+                    job_last_uncompressed_size = job_uncompressed_size


Let's deduplicate this with the block on line 68.

kirkrodrigues · 2024-05-13T23:28:36Z

components/job-orchestration/job_orchestration/scheduler/compress/compression_scheduler.py

-                if not task_result["status"] == CompressionTaskStatus.SUCCEEDED:
-                    task_result = CompressionTaskFailureResult.parse_obj(task_result)
+                task_result = CompressionTaskResult.parse_obj(task_result)
+                if not task_result.status == CompressionTaskStatus.SUCCEEDED:


Suggested change

if not task_result.status == CompressionTaskStatus.SUCCEEDED:

if task_result.status != CompressionTaskStatus.SUCCEEDED:

That said, this if-else would be simpler if we swapped the cases.

kirkrodrigues · 2024-05-14T10:44:25Z

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

+        if not line:
+            break
+        stats = json.loads(line.decode("ascii"))
+        if stats["id"] != last_archive_id:


Hm, since an archive may print its stats multiple times (after every segment is created in clp's case), then we would still need the previous logic that keeps last_archive_stats, right?

The previous logic is to check last_archive_stats['id'], if it's different from current id, we update everything. That's the same as current logic?

Not exactly. In the previous logic if last_archive_stats['id'] was different from the current ID, we would add uncompressed_size and size from last_archive_stats, but in the current code, we're adding the values from stats.

To see why this is a problem, imagine clp creates two archives with two segments each, meaning it will print the archive stats 4 times, something like this:

archive-1-seg-1: uncompressed_size = 10, size = 1

archive-1-seg-2: uncompressed_size = 20, size = 2

archive-2-seg-1: uncompressed_size = 5, size = 1

archive-2-seg-2: uncompressed_size = 10, size 2

In the current code, when we see the printout of (1), we will do total_uncompressed_size += 10, size += 1. When we see the printout of (3), we will do total_uncompressed_size += 5, size += 1. This will give us total_uncompressed_size = 15, size = 2. But it should be total_uncompressed_size = 30, size = 4.

Oh I see your point. uncompressed_size is cumulative. I thought we could accept the first one, and abandon the rest with the same id.

kirkrodrigues · 2024-05-14T10:57:21Z

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

-        ).dict()
-    else:
-        return CompressionTaskFailureResult(
+    with closing(sql_adapter.create_connection(True)) as db_conn, closing(


I still think we should try to keep our MySQL connections short, even if frequent; so I would prefer we open a connection only just before we perform a write (currently, this opens a connection before we start compression, which itself could take a long time depending on the dataset and config).

MySQL's blog says that it's capable of handling a lot of short connections, but its default concurrent connection limit is only 151. We should probably benchmark this for ourselves at some point (a long time ago, @haiqi96 had done some scalability benchmarking that showed MySQL struggled with 20 concurrent connections performing inserts), but for now, I think following their advice is the safer option.

Yeah, my original plan was to use short connection. But I notice in compression scheduler, we maintain a long connection.

In the scheduler's case, it's only maintaining one connection that it's using for polling (among other things), right? In theory we could make it use shorter connections, but there I'm not sure it will make much difference (we should still measure at some point though).

Got it. For the compression task, we read the output from the process and update the database. Do you think we should open a connection each time when we get a new line?

I think we could try opening a connection each time we need to update the archive's stats + tags (which would only be every time we finish an archive) and then once at the end of the task.

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

kirkrodrigues · 2024-05-15T16:52:46Z

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

+        logger.error("Must specify at least one field to update")
+        raise ValueError


Suggested change

logger.error("Must specify at least one field to update")

raise ValueError

raise ValueError("Must specify at least one field to update")

kirkrodrigues · 2024-05-15T16:52:56Z

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

+        logger.error("Must specify at least one field to update")
+        raise ValueError


Suggested change

logger.error("Must specify at least one field to update")

raise ValueError

raise ValueError("Must specify at least one field to update")

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

components/clp-package-utils/clp_package_utils/scripts/native/compress.py

components/job-orchestration/job_orchestration/executor/compress/fs_compression_task.py

wraymo added 6 commits May 9, 2024 16:12

print real-time statistics during compression

7fa1d4a

remove unused headers

8c65963

fix a lint error

7aabb51

fix a bug

5a1bc8a

fix a bug

3f94795

fix a bug

1656247

wraymo added 3 commits May 12, 2024 15:41

Merge remote-tracking branch 'origin/main' into compression_statistics

486138d

update job status at archive level

be7e370

update partition_uncompressed_size and partition_comppressed_size for…

e3dd58f

… failed tasks

wraymo requested a review from kirkrodrigues May 13, 2024 20:55

kirkrodrigues requested changes May 14, 2024

View reviewed changes

wraymo added 5 commits May 14, 2024 16:52

apply suggestions from code review

29f2f38

fix lint error

088924e

apply suggestions from code review

897db91

fix some bugs

31744b2

fix a bug

9439608

wraymo requested a review from kirkrodrigues May 15, 2024 15:11

kirkrodrigues requested changes May 15, 2024

View reviewed changes

apply suggestions from code review

f98a205

wraymo requested a review from kirkrodrigues May 15, 2024 18:12

apply suggestions from code review

1738a86

kirkrodrigues changed the title ~~clp-package: Add support for printing real-time compression statistics~~ clp-package: Add support for printing real-time compression statistics. May 15, 2024

kirkrodrigues approved these changes May 15, 2024

View reviewed changes

wraymo merged commit 69b1434 into y-scope:main May 15, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clp-package: Add support for printing real-time compression statistics. #388

clp-package: Add support for printing real-time compression statistics. #388

wraymo commented May 10, 2024 •

edited

Loading

kirkrodrigues commented May 10, 2024

wraymo commented May 10, 2024

kirkrodrigues May 13, 2024

kirkrodrigues May 13, 2024

kirkrodrigues May 13, 2024

kirkrodrigues May 14, 2024

wraymo May 14, 2024 •

edited

Loading

kirkrodrigues May 14, 2024

wraymo May 15, 2024 •

edited

Loading

kirkrodrigues May 14, 2024

wraymo May 14, 2024

kirkrodrigues May 14, 2024

wraymo May 15, 2024

kirkrodrigues May 15, 2024

kirkrodrigues May 15, 2024

kirkrodrigues May 15, 2024

	if not task_result.status == CompressionTaskStatus.SUCCEEDED:
	if task_result.status != CompressionTaskStatus.SUCCEEDED:

		logger.error("Must specify at least one field to update")
		raise ValueError

	logger.error("Must specify at least one field to update")
	raise ValueError
	raise ValueError("Must specify at least one field to update")

clp-package: Add support for printing real-time compression statistics. #388

clp-package: Add support for printing real-time compression statistics. #388

Conversation

wraymo commented May 10, 2024 • edited Loading

References

Description

Validation performed

kirkrodrigues commented May 10, 2024

wraymo commented May 10, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo May 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo May 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo commented May 10, 2024 •

edited

Loading

wraymo May 14, 2024 •

edited

Loading

wraymo May 15, 2024 •

edited

Loading