Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2u/course optimizer #35887

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from
Draft

2u/course optimizer #35887

wants to merge 16 commits into from

Conversation

rayzhou-bit
Copy link
Contributor

@rayzhou-bit rayzhou-bit commented Nov 20, 2024

Description

This PR creates the backend for Course Optimizer Link Checker, which will scan through a course and check for broken links. This functionality imitates what is currently in place for export. 2 apis are created here:

link_check POST

  • Queues a task to start the link check process through the celery queue.
  • Results for the broken links scan is stored as a list of tuples: [block_id, broken_link]

link_check_status GET

  • Returns the status of the link check process.
  • Returns results of link_check if process is successful.
  • Result Data Transfer Object returns broken links along with relevant ancestor data for the block they are found in.

Technical considerations:

  • The results of link check scan is currently saved as a UserTaskArtifact file. While this is the simplest for implementation, arguments can be made to save this data in tables instead.
  • Benefits for using UserTaskArtifact file: Easy implementation as this mimics the current export functionality.
  • Benefits for using a database table: Good foundation for accessing thinner slices of data for broken links. While not needed for the current functionality being developed, it could be useful for future updates. For example, authors could be notified on the broken links of a quiz a couple of days before learners take the quiz. Another example is it would be easier to analyze data such as finding the average number of broken links per course.

Supporting information

https://2u-internal.atlassian.net/browse/TNL-11782

Testing instructions

The following example is for demo course course-v1:edX+DemoX+Demo_Course.

  1. Copy the curl for an export call in your local environment.
curl 'http://localhost:18010/export/course-v1:edX+DemoX+Demo_Course' \
  -X 'POST' \
  -H 'Accept: application/json, text/javascript, */*; q=0.01' \
  -H 'Accept-Language: en-US,en;q=0.9' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: keep-alive' \
  -H 'Content-Length: 0' \
  ...
  1. Replace export with link_check.
  2. Make this call in the terminal.
  3. This should return the following if successful.
{
  "LinkCheckStatus": 1
}
  1. Access http://localhost:18010/link_check_status/course-v1:edX+DemoX+Demo_Course in your browser. You should see the results of the link check scan.
{
  "LinkCheckStatus": 3,
  "LinkCheckOutput": {
    "course": {
      "display_name": "Demonstration Course",
      "d8a6192ade314473a78242dfeedfbf5b": {
        "display_name": "Introduction",
        "edx_introduction": {
          "display_name": "Demo Course Overview",
          "vertical_0270f6de40fc": {
            "display_name": "Introduction: Video and Sequences",
            "030e35c4756a4ddc8d40b95fbbfff4d4": {
              "display_name": "Blank HTML Page",
              "url": "/course/course-v1:edX+DemoX+Demo_Course/editor/html/block-v1:edX+DemoX+Demo_Course+type@html+block@030e35c4756a4ddc8d40b95fbbfff4d4",
              "broken_links": [
                "/assets/courseware/v1/506da5d6f866e8f0be44c5df8b6e6b2a/asset-v1:edX+DemoX+Demo_Course+type@asset+block/getting-started_x250.png",
                ...

Other information

  • This PR is to be paired with a frontend PR in frontend-app-authoring.

@rayzhou-bit rayzhou-bit marked this pull request as draft November 20, 2024 00:23
json_content = json.loads(content)
broken_links_dto = _create_dto(json_content, request.user)
elif task_status.state in (UserTaskStatus.FAILED, UserTaskStatus.CANCELED):
status = max(-(task_status.completed_steps + 1), -2)
Copy link
Contributor

@bszabo bszabo Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a max() function to compute status is highly suspicious! (Ditto the min function below). It seems like you're combining the number of completed steps with information as to whether the task failed or was canceled. Have you considered using independent variables or fields to capture these rather disparate concepts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is code I copied over from import_export.py. I believe the api will return a negative number on a fail related to the step in the process.
I agree this looks bad... but I would want to update code in both places at the same time. Maybe in a separate PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate PR but part of the same task? General stewardship rule is "leave the code a little better than you found it"

Copy link
Member

@jesperhodge jesperhodge Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
below the line. Not important.
You can just go with whatever works for you.

I see people often just copying over code from older places since it just works, and thus creating new not-great code. I don't generally think that's a good idea. In my opinion we can extract code that's reused into helper functions if the code is good, but if not, I'd prefer writing new code that is better. It also means the code author needs to think and understand the code a little bit more in-depth than when they copy it.

If you want to change the code but it's too much trouble to extract the function from the other place, I'd say just change it in the new place and then just extract the parts that are still the same

Copy link
Member

@jesperhodge jesperhodge Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
Part of improving task statuses


log = logging.getLogger(__name__)

STATUS_FILTERS = user_tasks_settings.USER_TASKS_STATUS_FILTERS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one line comment as to how these will be used would be helpful. The module filters user tasks on the basis of these, but I'm unfamiliar with this package's use. The one line saves me the trouble of having to go read up on the package if all I want is some notion of what it's used for.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above "# Tuple containing zero or more filters for UserTaskStatus listing REST API calls." does not help my understanding at all. I understand we apply filters. What do we apply them for?

I went through code quite a bit to figure that out, which was not straightforward.

The result:
This is an adjustable django setting. It specifies general filters that are applied to all tasks before they are returned to the user. Currently this setting is not specified in edx-app settings. So it defaults to a default filter. Which is quite hard to find to figure out where this is defined (in some other repo called django_user_tasks.
This is the code place:
https://github.com/openedx/django-user-tasks/blob/20e4e6eb81f5e981d5bbdba390ffcf8e6c3e0d0a/user_tasks/filters.py#L9

So it will be quite hard for a user to understand what the filter actually does or what it's purpose is.

The explaining comments for the default filter are:

    Default filter for UserTaskArtifact listings in the REST API.

    Ensures that superusers can see all artifacts, but other users
    can only see artifacts for tasks they personally triggered.

And then more specific:

Filter out any artifacts which the requesting user does not have permission to view.

On top of that, there are two sets of these filters - one is for task artifacts, one for statuses. They function the exact same way.

This is important to capture on edx-platform in some way. Maybe something like: "These filters can be overridden in django settings of edx-platform. If they are not, the default behavior is the following: ..."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
Above the line - it's just a comment to add and it's super helpful I feel

@rayzhou-bit
Copy link
Contributor Author

@bszabo I updated a lot of the organization in tasks.py. I agree with you on the iffy code practices (using max / min / integer for status), but this is currently how UserTaskStatus is used and I feel it's better to follow it for now.

@rayzhou-bit rayzhou-bit requested a review from bszabo November 21, 2024 18:35
@bszabo
Copy link
Contributor

bszabo commented Nov 21, 2024

Thanks for the editorial changes, Ray. I'm stepping away from this review with the expectation that Jesper will give it a lookover from a functional perspective. If it's possible to attend to the funky status definition before moving on to new things, I would strongly recommend that, even if it ends up being in a different PR.

@bszabo
Copy link
Contributor

bszabo commented Nov 21, 2024

If you take a step back, and look at this search for broken links as a first installment towards course optimization, you can see that course optimization will entail a sequence of activities being carried out, with each intended to potentially improvew a course. Viewed that way, the natural questions to ask will be "which activity is currently being worked on?" and "what is the status for activity X?". For the latter question the natural answers will be not started, in progress, succeeded, or failed with error message Y.

It seems to me that it would make sense to organize even this first installment somewhat in those lines. The solution you borrowed from import/export is conflating concepts in a way I don't think is good.

requested_format = request.GET.get('_accept', request.META.get('HTTP_ACCEPT', 'text/html'))

check_broken_links.delay(request.user.id, course_key_string, request.LANGUAGE_CODE)
return JsonResponse({'LinkCheckStatus': 1})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts for readability:
I have the impression that 1 here seems like a magic number. If I look at the code later, it would be hard to figure out what it means, and it would be easy for me when editing the code to return something that is somehow wrong.

I'll need to look in another function to figure out what link check statuses there are according to a comment. And if someone changes the logic but forgets to update that comment, that can lead to bugs.

A pattern I like to use to avoid this problem is to define something similar to an enum.

It could be:

LINK_CHECK_STATUSES = {
    "IN_PROGRESS": 1,
    "SUCCESS": 3,
    ...
}

And then here and in other places you can just use it like
return JsonResponse({'LinkCheckStatus': LINK_CHECK_STATUSES["IN_PROGRESS"]})

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
Above the line, important
You're already overhauling link check statuses so when you've implemented that you can just resolve this discussion

json_content = json.loads(content)
broken_links_dto = _create_dto(json_content, request.user)
elif task_status.state in (UserTaskStatus.FAILED, UserTaskStatus.CANCELED):
status = max(-(task_status.completed_steps + 1), -2)
Copy link
Member

@jesperhodge jesperhodge Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
below the line. Not important.
You can just go with whatever works for you.

I see people often just copying over code from older places since it just works, and thus creating new not-great code. I don't generally think that's a good idea. In my opinion we can extract code that's reused into helper functions if the code is good, but if not, I'd prefer writing new code that is better. It also means the code author needs to think and understand the code a little bit more in-depth than when they copy it.

If you want to change the code but it's too much trouble to extract the function from the other place, I'd say just change it in the new place and then just extract the parts that are still the same

course_key = CourseKey.from_string(course_key_string)
if not has_course_author_access(request.user, course_key):
raise PermissionDenied()
courselike_block = modulestore().get_course(course_key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a courselike_block? Naming isn't that clear to me

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
nit. Not important.

Copy link
Contributor

@bradenmacdonald bradenmacdonald Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We prefer the term "learning context" over "courselike" now. But in this case, I doubt this works with libraries at all, so why not just call it course_root_block and be super clear?

courselike_block = modulestore().get_course(course_key)
if courselike_block is None:
raise Http404
context = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this context variable used anywhere? I may just be blind

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
Below the line. Nice to have but not important.


log = logging.getLogger(__name__)

STATUS_FILTERS = user_tasks_settings.USER_TASKS_STATUS_FILTERS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above "# Tuple containing zero or more filters for UserTaskStatus listing REST API calls." does not help my understanding at all. I understand we apply filters. What do we apply them for?

I went through code quite a bit to figure that out, which was not straightforward.

The result:
This is an adjustable django setting. It specifies general filters that are applied to all tasks before they are returned to the user. Currently this setting is not specified in edx-app settings. So it defaults to a default filter. Which is quite hard to find to figure out where this is defined (in some other repo called django_user_tasks.
This is the code place:
https://github.com/openedx/django-user-tasks/blob/20e4e6eb81f5e981d5bbdba390ffcf8e6c3e0d0a/user_tasks/filters.py#L9

So it will be quite hard for a user to understand what the filter actually does or what it's purpose is.

The explaining comments for the default filter are:

    Default filter for UserTaskArtifact listings in the REST API.

    Ensures that superusers can see all artifacts, but other users
    can only see artifacts for tasks they personally triggered.

And then more specific:

Filter out any artifacts which the requesting user does not have permission to view.

On top of that, there are two sets of these filters - one is for task artifacts, one for statuses. They function the exact same way.

This is important to capture on edx-platform in some way. Maybe something like: "These filters can be overridden in django settings of edx-platform. If they are not, the default behavior is the following: ..."


# catch all exceptions so we can record useful error messages
except Exception as exception: # pylint: disable=broad-except
LOGGER.exception('Error checking links for course %s', courselike_key, exc_info=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... I see this is done this way in other tasks.
Is this how celery works? No actual error should be raised in the task? Because this swallows all errors.
I understand it does a logger exception. But I'm a bit confused because I don't know if this actually the correct way it should be handled with celery, or whether the error should not be caught so it can cancel the running celery task? Need to look into it.

I don't exactly know how this logger is configured but this way I don't see datadog showing an exception, for example. Maybe if we do swallow this error, we should add some logic to make it show in datadog celery errors? Or do you think the logger exception somehow does this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Focus:
Out of scope. Create issue to look into this.
Quite important but better done separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants