Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Turn on WAL mode for cluster job table #3923

Merged

Conversation

wizenheimer
Copy link
Contributor

@wizenheimer wizenheimer commented Sep 7, 2024

Issues Addressed

Changes Made

  • Enable WAL mode for the given SQLite connection.
  • Add a wrapper to retry a database operation if it fails due to database lock.

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting the PR @wizenheimer! Are we able to reproduce the issue mentioned in #3863? It would be great if we can include a reproduction in the PR description, which fails on master and succeed on this PR. : )

@@ -83,4 +106,5 @@ def __init__(self, db_path: str, create_table: Callable):
# errors. This is a hack, but it works.
self.conn = sqlite3.connect(db_path, timeout=10)
self.cursor = self.conn.cursor()
enable_wal_mode(self.conn) # Enable WAL mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this for all table creation, let's keep it in creat_table function as what we did in https://github.com/skypilot-org/skypilot/blob/master/sky/global_user_state.py#L41-L48

Comment on lines 340 to 349
def db_operation():
rows = _CURSOR.execute('SELECT status FROM jobs WHERE job_id=(?)',
(job_id,))
for (status,) in rows:
if status is None:
return None
return JobStatus(status)
return None

return db_utils.retry_on_database_locked(db_operation)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the retry on locked out to keep this PR only about WAL.

@wizenheimer
Copy link
Contributor Author

Hey @Michaelvll,
Really appreciate the feedback. Made the changes as proposed. Adding a reproduction script below.

# attack_db.py - Simulate a multi-threaded attack on the SkyPilot SQLite database
import sqlite3
import time
import os
from concurrent.futures import ThreadPoolExecutor, as_completed

# Path to the SkyPilot SQLite database
DB_PATH = os.path.expanduser("~/.sky/jobs.db")


def run_database_operations(thread_id):
    for i in range(
        100
    ):  # Perform 100 operations per thread. Michaelvll: This can be increased/decreased for a more realistic simulation. But tuning timeouts could yield a better signal.
        try:
            # Use a very short timeout to increase chances of locking. Michaelvll this could be increased/decreased for a more realistic simulation.
            # At 0.1 seconds = 100ms, this is a very short timeout. But master raises more errors compared to current branch.
            # At 0.5 seconds = 500ms, this is a short timeout. But master raises errors and current branch raises no errors.
            # At 1.0 seconds = 1000ms, this is a long timeout. But master raises no errors and current branch raises no errors.
            conn = sqlite3.connect(DB_PATH, timeout=0.5)
            cursor = conn.cursor()

            # Simulate a write operation (inserting a job)
            # Michaelvll: I have added a sleep to increase the chances of contention. These queries are sourced from the SkyPilot codebase.
            cursor.execute(
                """
                INSERT INTO jobs (job_name, username, submitted_at, status, run_timestamp)
                VALUES (?, ?, ?, ?, ?)
            """,
                (
                    f"test_job_{thread_id}_{i}",
                    "test_user",
                    time.time(),
                    "PENDING",
                    str(time.time()),
                ),
            )

            # Simulate a read operation (querying job status)
            cursor.execute(
                "SELECT status FROM jobs WHERE job_name = ?",
                (f"test_job_{thread_id}_{i}",),
            )

            # Simulate an update operation (updating job status)
            cursor.execute(
                "UPDATE jobs SET status = ? WHERE job_name = ?",
                ("RUNNING", f"test_job_{thread_id}_{i}"),
            )

            # Simulate a delete operation (removing the job)
            cursor.execute(
                "DELETE FROM jobs WHERE job_name = ?", (f"test_job_{thread_id}_{i}",)
            )

            conn.commit()
        except sqlite3.OperationalError as e:
            if "database is locked" in str(e):
                print(f"Thread {thread_id}: Caught database lock error: {e}")
            else:
                print(f"Thread {thread_id}: Other SQLite error: {e}")
        except Exception as e:
            print(f"Thread {thread_id}: Unexpected error: {e}")
        finally:
            if conn:
                conn.close()

        time.sleep(0.01)  # Adding a small delay to increase chances of contention


def main():
    num_threads = 20

    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        futures = [
            executor.submit(run_database_operations, i) for i in range(num_threads)
        ]

        for future in as_completed(futures):
            future.result()


if __name__ == "__main__":
    main()

Steps:

  1. On master branch
    1. Run attack_db.py script
    2. Observe the terminal output
      Thread 11: Caught database lock error: database is locked
      Thread 9: Caught database lock error: database is locked
      Thread 2: Caught database lock error: database is locked
      Thread 1: Caught database lock error: database is locked
      Thread 5: Caught database lock error: database is locked
      Thread 12: Caught database lock error: database is locked
      Thread 19: Caught database lock error: database is locked
      Thread 0: Caught database lock error: database is locked
      Thread 15: Caught database lock error: database is locked
      Thread 18: Caught database lock error: database is locked
      Thread 17: Caught database lock error: database is locked
      Thread 0: Caught database lock error: database is locked
      Thread 2: Caught database lock error: database is locked
      Thread 1: Caught database lock error: database is locked
      Thread 8: Caught database lock error: database is locked
      Thread 15: Caught database lock error: database is locked
      Thread 13: Caught database lock error: database is locked
  2. On current branch
    1. Delete contents of ~/.sky directory. This would remove the underlying DBs for sky.
    2. Now run sky status. This would recreate the underlying DB using the current branch code.
    3. Run attack_db.py script.
    4. Observe the terminal output
    # No output i.e. database didn't face locking issue.

Additional Note:

  1. The attack_db.py script has 2 tunable parameters: timeouts and operations
  2. Each of these can be increased/decreased for a more realistic simulation. Right now there are 100 ops each happening at timeout of 500 ms.

@Michaelvll Michaelvll self-requested a review September 20, 2024 08:01
Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for adding this @wizenheimer and sorry for the delay! LGTM.

@Michaelvll Michaelvll added this pull request to the merge queue Oct 14, 2024
Merged via the queue into skypilot-org:master with commit 340f384 Oct 14, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants