Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: globs & regexps for SitemapRequestList #2631

Merged
merged 4 commits into from
Aug 26, 2024

Conversation

barjin
Copy link
Contributor

@barjin barjin commented Aug 23, 2024

Adds globs, regexps and exclude options to SitemapRequestList to make it list only URLs that match those patterns.

While this looked like a single-responsibility violation to me at first, these changes are imo quite justifiable, as the possible use cases (wanting to crawl only a part of a website while utilizing the official sitemap) seem to be very sane.

@barjin barjin added the adhoc Ad-hoc unplanned task added during the sprint. label Aug 23, 2024
@barjin barjin requested a review from janbuchar August 23, 2024 12:12
@barjin barjin self-assigned this Aug 23, 2024
@github-actions github-actions bot added this to the 96th sprint - Tooling team milestone Aug 23, 2024
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels Aug 23, 2024
Comment on lines 239 to 247
!this.urlExcludePatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}) &&
(this.urlPatternObjects.length === 0 ||
this.urlPatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bruh, extract those massive boolean expressions into variables please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I first thought of a function that would return these arrow functions, closuring a URL (so we could only write this.urlPatternObjects.some(matchUrl(url)) - imo beautifully readable). Both of the arrow functions currently are the same thing. I just didn't want to define this matchUrl function in the middle of isUrlMatchingPatterns (as it's gonna be called many many many times) - and making it a private method in SitemapRequestList didn't really feel right either.

Idk, what do you think?

@janbuchar janbuchar self-requested a review August 23, 2024 13:54
Copy link
Contributor

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, except for that one comment

Comment on lines 237 to 249
private isUrlMatchingPatterns(url: string): boolean {
return (
!this.urlExcludePatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}) &&
(this.urlPatternObjects.length === 0 ||
this.urlPatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}))
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about something like this:

Suggested change
private isUrlMatchingPatterns(url: string): boolean {
return (
!this.urlExcludePatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}) &&
(this.urlPatternObjects.length === 0 ||
this.urlPatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
}))
);
}
private isUrlMatchingPatterns(url: string): boolean {
const matchesSomeExcludePattern = this.urlExcludePatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
})
const matchesSomeIncludePattern = this.urlPatternObjects.some((patternObject) => {
const { regexp, glob } = patternObject;
return (regexp && url.match(regexp)) || (glob && minimatch(url, glob, { nocase: true }));
})
return (
!matchesSomeExcludePattern &&
(this.urlPatternObjects.length === 0 || matchesSomeIncludePattern)
);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I guess I went a bit overboard... but I'd argue that my solution makes it ever so slightly even DRYier.

Tomayto, tomahto, I think both of the solutions solve the same issue of readability in the condition, albeit each one a bit differently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what you did is cool. Feel free to merge.

@barjin barjin merged commit b5fd3a9 into master Aug 26, 2024
9 checks passed
@barjin barjin deleted the feat/sitemap-request-list-filtering branch August 26, 2024 09:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants