2u/course optimizer #35887

rayzhou-bit · 2024-11-20T00:22:28Z

Description

This PR creates the backend for Course Optimizer Link Checker, which will scan through a course and check for broken links. This functionality imitates what is currently in place for export. 2 apis are created here:

link_check POST

Queues a task to start the link check process through the celery queue.
Results for the broken links scan is stored as a list of tuples: [block_id, broken_link]

link_check_status GET

Returns the status of the link check process.
Returns results of link_check if process is successful.
Result Data Transfer Object returns broken links along with relevant ancestor data for the block they are found in.

Technical considerations:

The results of link check scan is currently saved as a UserTaskArtifact file. While this is the simplest for implementation, arguments can be made to save this data in tables instead.
Benefits for using UserTaskArtifact file: Easy implementation as this mimics the current export functionality.
Benefits for using a database table: Good foundation for accessing thinner slices of data for broken links. While not needed for the current functionality being developed, it could be useful for future updates. For example, authors could be notified on the broken links of a quiz a couple of days before learners take the quiz. Another example is it would be easier to analyze data such as finding the average number of broken links per course.

Supporting information

https://2u-internal.atlassian.net/browse/TNL-11782

Testing instructions

The following example is for demo course course-v1:edX+DemoX+Demo_Course.

Copy the curl for an export call in your local environment.

curl 'http://localhost:18010/export/course-v1:edX+DemoX+Demo_Course' \
  -X 'POST' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Content-Length: 0' \
  ...

Replace export with link_check.
Make this call in the terminal.
This should return the following if successful.

{
  "LinkCheckStatus": 1
}

Access http://localhost:18010/link_check_status/course-v1:edX+DemoX+Demo_Course in your browser. You should see the results of the link check scan.

{
  "LinkCheckStatus": 3,
  "LinkCheckOutput": {
    "course": {
      "display_name": "Demonstration Course",
      "d8a6192ade314473a78242dfeedfbf5b": {
        "display_name": "Introduction",
        "edx_introduction": {
          "display_name": "Demo Course Overview",
          "vertical_0270f6de40fc": {
            "display_name": "Introduction: Video and Sequences",
            "030e35c4756a4ddc8d40b95fbbfff4d4": {
              "display_name": "Blank HTML Page",
              "url": "/course/course-v1:edX+DemoX+Demo_Course/editor/html/block-v1:edX+DemoX+Demo_Course+type@html+block@030e35c4756a4ddc8d40b95fbbfff4d4",
              "broken_links": [
                "/assets/courseware/v1/506da5d6f866e8f0be44c5df8b6e6b2a/asset-v1:edX+DemoX+Demo_Course+type@asset+block/getting-started_x250.png",
                ...

Other information

This PR is to be paired with a frontend PR in frontend-app-authoring.

cms/djangoapps/contentstore/tasks.py

cms/djangoapps/contentstore/views/course_optimizer.py

bszabo · 2024-11-20T15:35:20Z

cms/djangoapps/contentstore/views/course_optimizer.py

+            json_content = json.loads(content)
+            broken_links_dto = _create_dto(json_content, request.user)
+    elif task_status.state in (UserTaskStatus.FAILED, UserTaskStatus.CANCELED):
+        status = max(-(task_status.completed_steps + 1), -2)


Using a max() function to compute status is highly suspicious! (Ditto the min function below). It seems like you're combining the number of completed steps with information as to whether the task failed or was canceled. Have you considered using independent variables or fields to capture these rather disparate concepts?

This is code I copied over from import_export.py. I believe the api will return a negative number on a fail related to the step in the process.
I agree this looks bad... but I would want to update code in both places at the same time. Maybe in a separate PR.

Separate PR but part of the same task? General stewardship rule is "leave the code a little better than you found it"

Focus:
below the line. Not important.
You can just go with whatever works for you.

I see people often just copying over code from older places since it just works, and thus creating new not-great code. I don't generally think that's a good idea. In my opinion we can extract code that's reused into helper functions if the code is good, but if not, I'd prefer writing new code that is better. It also means the code author needs to think and understand the code a little bit more in-depth than when they copy it.

If you want to change the code but it's too much trouble to extract the function from the other place, I'd say just change it in the new place and then just extract the parts that are still the same

Focus:
Part of improving task statuses

cms/djangoapps/contentstore/views/course_optimizer.py

bszabo · 2024-11-20T15:44:41Z

cms/djangoapps/contentstore/views/course_optimizer.py

+
+log = logging.getLogger(__name__)
+
+STATUS_FILTERS = user_tasks_settings.USER_TASKS_STATUS_FILTERS


A one line comment as to how these will be used would be helpful. The module filters user tasks on the basis of these, but I'm unfamiliar with this package's use. The one line saves me the trouble of having to go read up on the package if all I want is some notion of what it's used for.

The comment above "# Tuple containing zero or more filters for UserTaskStatus listing REST API calls." does not help my understanding at all. I understand we apply filters. What do we apply them for?

I went through code quite a bit to figure that out, which was not straightforward.

The result:
This is an adjustable django setting. It specifies general filters that are applied to all tasks before they are returned to the user. Currently this setting is not specified in edx-app settings. So it defaults to a default filter. Which is quite hard to find to figure out where this is defined (in some other repo called django_user_tasks.
This is the code place:
https://github.com/openedx/django-user-tasks/blob/20e4e6eb81f5e981d5bbdba390ffcf8e6c3e0d0a/user_tasks/filters.py#L9

So it will be quite hard for a user to understand what the filter actually does or what it's purpose is.

The explaining comments for the default filter are:

Default filter for UserTaskArtifact listings in the REST API. Ensures that superusers can see all artifacts, but other users can only see artifacts for tasks they personally triggered.

And then more specific:

Filter out any artifacts which the requesting user does not have permission to view.

On top of that, there are two sets of these filters - one is for task artifacts, one for statuses. They function the exact same way.

This is important to capture on edx-platform in some way. Maybe something like: "These filters can be overridden in django settings of edx-platform. If they are not, the default behavior is the following: ..."

Focus:
Above the line - it's just a comment to add and it's super helpful I feel

cms/djangoapps/contentstore/views/course_optimizer.py

cms/djangoapps/contentstore/tasks.py

rayzhou-bit · 2024-11-21T18:35:50Z

@bszabo I updated a lot of the organization in tasks.py. I agree with you on the iffy code practices (using max / min / integer for status), but this is currently how UserTaskStatus is used and I feel it's better to follow it for now.

bszabo · 2024-11-21T18:55:50Z

Thanks for the editorial changes, Ray. I'm stepping away from this review with the expectation that Jesper will give it a lookover from a functional perspective. If it's possible to attend to the funky status definition before moving on to new things, I would strongly recommend that, even if it ends up being in a different PR.

bszabo · 2024-11-21T19:13:36Z

If you take a step back, and look at this search for broken links as a first installment towards course optimization, you can see that course optimization will entail a sequence of activities being carried out, with each intended to potentially improvew a course. Viewed that way, the natural questions to ask will be "which activity is currently being worked on?" and "what is the status for activity X?". For the latter question the natural answers will be not started, in progress, succeeded, or failed with error message Y.

It seems to me that it would make sense to organize even this first installment somewhat in those lines. The solution you borrowed from import/export is conflating concepts in a way I don't think is good.

jesperhodge · 2024-11-22T15:54:14Z

cms/djangoapps/contentstore/views/course_optimizer.py

+    requested_format = request.GET.get('_accept', request.META.get('HTTP_ACCEPT', 'text/html'))
+
+    check_broken_links.delay(request.user.id, course_key_string, request.LANGUAGE_CODE)
+    return JsonResponse({'LinkCheckStatus': 1})


Some thoughts for readability:
I have the impression that 1 here seems like a magic number. If I look at the code later, it would be hard to figure out what it means, and it would be easy for me when editing the code to return something that is somehow wrong.

I'll need to look in another function to figure out what link check statuses there are according to a comment. And if someone changes the logic but forgets to update that comment, that can lead to bugs.

A pattern I like to use to avoid this problem is to define something similar to an enum.

It could be:

LINK_CHECK_STATUSES = { "IN_PROGRESS": 1, "SUCCESS": 3, ... }

And then here and in other places you can just use it like
return JsonResponse({'LinkCheckStatus': LINK_CHECK_STATUSES["IN_PROGRESS"]})

Focus:
Above the line, important
You're already overhauling link check statuses so when you've implemented that you can just resolve this discussion

jesperhodge · 2024-11-22T16:04:41Z

cms/djangoapps/contentstore/views/course_optimizer.py

+            json_content = json.loads(content)
+            broken_links_dto = _create_dto(json_content, request.user)
+    elif task_status.state in (UserTaskStatus.FAILED, UserTaskStatus.CANCELED):
+        status = max(-(task_status.completed_steps + 1), -2)


Focus:
below the line. Not important.
You can just go with whatever works for you.

I see people often just copying over code from older places since it just works, and thus creating new not-great code. I don't generally think that's a good idea. In my opinion we can extract code that's reused into helper functions if the code is good, but if not, I'd prefer writing new code that is better. It also means the code author needs to think and understand the code a little bit more in-depth than when they copy it.

If you want to change the code but it's too much trouble to extract the function from the other place, I'd say just change it in the new place and then just extract the parts that are still the same

jesperhodge · 2024-11-22T16:24:07Z

cms/djangoapps/contentstore/views/course_optimizer.py

+    course_key = CourseKey.from_string(course_key_string)
+    if not has_course_author_access(request.user, course_key):
+        raise PermissionDenied()
+    courselike_block = modulestore().get_course(course_key)


What's a courselike_block? Naming isn't that clear to me

Focus:
nit. Not important.

We prefer the term "learning context" over "courselike" now. But in this case, I doubt this works with libraries at all, so why not just call it course_root_block and be super clear?

jesperhodge · 2024-11-22T16:25:48Z

cms/djangoapps/contentstore/views/course_optimizer.py

+    courselike_block = modulestore().get_course(course_key)
+    if courselike_block is None:
+        raise Http404
+    context = {


Is this context variable used anywhere? I may just be blind

Focus:
Below the line. Nice to have but not important.

jesperhodge · 2024-11-22T16:52:35Z

cms/djangoapps/contentstore/views/course_optimizer.py

+
+log = logging.getLogger(__name__)
+
+STATUS_FILTERS = user_tasks_settings.USER_TASKS_STATUS_FILTERS


The comment above "# Tuple containing zero or more filters for UserTaskStatus listing REST API calls." does not help my understanding at all. I understand we apply filters. What do we apply them for?

I went through code quite a bit to figure that out, which was not straightforward.

The result:
This is an adjustable django setting. It specifies general filters that are applied to all tasks before they are returned to the user. Currently this setting is not specified in edx-app settings. So it defaults to a default filter. Which is quite hard to find to figure out where this is defined (in some other repo called django_user_tasks.
This is the code place:
https://github.com/openedx/django-user-tasks/blob/20e4e6eb81f5e981d5bbdba390ffcf8e6c3e0d0a/user_tasks/filters.py#L9

So it will be quite hard for a user to understand what the filter actually does or what it's purpose is.

The explaining comments for the default filter are:

Default filter for UserTaskArtifact listings in the REST API. Ensures that superusers can see all artifacts, but other users can only see artifacts for tasks they personally triggered.

And then more specific:

Filter out any artifacts which the requesting user does not have permission to view.

On top of that, there are two sets of these filters - one is for task artifacts, one for statuses. They function the exact same way.

This is important to capture on edx-platform in some way. Maybe something like: "These filters can be overridden in django settings of edx-platform. If they are not, the default behavior is the following: ..."

jesperhodge · 2024-11-22T20:29:22Z

cms/djangoapps/contentstore/tasks.py

+
+    # catch all exceptions so we can record useful error messages
+    except Exception as exception:  # pylint: disable=broad-except
+        LOGGER.exception('Error checking links for course %s', courselike_key, exc_info=True)


So... I see this is done this way in other tasks.
Is this how celery works? No actual error should be raised in the task? Because this swallows all errors.
I understand it does a logger exception. But I'm a bit confused because I don't know if this actually the correct way it should be handled with celery, or whether the error should not be caught so it can cancel the running celery task? Need to look into it.

I don't exactly know how this logger is configured but this way I don't see datadog showing an exception, for example. Maybe if we do swallow this error, we should add some logic to make it show in datadog celery errors? Or do you think the logger exception somehow does this?

Focus:
Out of scope. Create issue to look into this.
Quite important but better done separately.

rayzhou-bit marked this pull request as draft November 20, 2024 00:23

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/views/course_optimizer.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/views/course_optimizer.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/views/course_optimizer.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/views/course_optimizer.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

bszabo reviewed Nov 20, 2024

View reviewed changes

cms/djangoapps/contentstore/tasks.py Outdated Show resolved Hide resolved

rayzhou-bit added 6 commits November 21, 2024 13:30

feat: init

3ab1d74

feat: apis

f616fce

feat: url processing

2fca01d

chore: cleanup

712041e

feat: tasks code readability

355533b

feat: name and description changes to course opti

e9fb603

rayzhou-bit force-pushed the 2u/course-optimizer branch from dcf94db to e9fb603 Compare November 21, 2024 18:30

rayzhou-bit requested a review from bszabo November 21, 2024 18:35

rayzhou-bit added 2 commits November 21, 2024 19:24

feat: remove GET part of link_check

84cae82

feat: reorg code around status

f53d578

jesperhodge requested changes Nov 22, 2024

View reviewed changes

jesperhodge reviewed Nov 22, 2024

View reviewed changes

feat: some code cleanup

0e41efb

feat: replace space with dash in status

233eb1f

jesperhodge mentioned this pull request Nov 26, 2024

Feat course optimizer page openedx/frontend-app-authoring#1533

Draft

rayzhou-bit and others added 6 commits December 2, 2024 21:16

feat: v0 rest_api wip

fc021ee

fix: remove code from old url code space

34ec30a

feat: messy new api wip

927b8c0

feat: make course optimizer scan only published version

d125084

Efficient logic to create DTO for link_check_status api (#35966)

6f98200

feat: locked link (#35976)

3f82c62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2u/course optimizer #35887

2u/course optimizer #35887

rayzhou-bit commented Nov 20, 2024 •

edited

Loading

bszabo Nov 20, 2024 •

edited

Loading

rayzhou-bit Nov 20, 2024

bszabo Nov 21, 2024

jesperhodge Nov 22, 2024 •

edited

Loading

jesperhodge Nov 26, 2024 •

edited

Loading

bszabo Nov 20, 2024

jesperhodge Nov 22, 2024

jesperhodge Nov 26, 2024

rayzhou-bit commented Nov 21, 2024

bszabo commented Nov 21, 2024

bszabo commented Nov 21, 2024

jesperhodge Nov 22, 2024

jesperhodge Nov 26, 2024

jesperhodge Nov 22, 2024 •

edited

Loading

jesperhodge Nov 22, 2024

jesperhodge Nov 26, 2024

bradenmacdonald Nov 27, 2024 •

edited

Loading

jesperhodge Nov 22, 2024

jesperhodge Nov 26, 2024

jesperhodge Nov 22, 2024

jesperhodge Nov 22, 2024

jesperhodge Nov 26, 2024


		log = logging.getLogger(__name__)

		STATUS_FILTERS = user_tasks_settings.USER_TASKS_STATUS_FILTERS

2u/course optimizer #35887

Are you sure you want to change the base?

2u/course optimizer #35887

Conversation

rayzhou-bit commented Nov 20, 2024 • edited Loading

Description

Supporting information

Testing instructions

Other information

bszabo Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesperhodge Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

jesperhodge Nov 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayzhou-bit commented Nov 21, 2024

bszabo commented Nov 21, 2024

bszabo commented Nov 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jesperhodge Nov 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bradenmacdonald Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rayzhou-bit commented Nov 20, 2024 •

edited

Loading

bszabo Nov 20, 2024 •

edited

Loading

jesperhodge Nov 22, 2024 •

edited

Loading

jesperhodge Nov 26, 2024 •

edited

Loading

jesperhodge Nov 22, 2024 •

edited

Loading

bradenmacdonald Nov 27, 2024 •

edited

Loading