Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for spec compliant URL parsing #3056

Merged
merged 4 commits into from
Oct 30, 2020

Conversation

Aenigma
Copy link
Contributor

@Aenigma Aenigma commented Oct 6, 2020

Hello! I was looking at another project while doing hacktoberfest wondering how URL parsing works for lbry and stumbled upon this file. After studying it, I noticed some inconsistencies with the spec so I decided I may have enough context to be able to fix it.

I believe this would close #2832

I saw that in the issue that you guys wanted to preserve support for the legacy URL scheme so I made it so that the new scheme is attempted before attempting the legacy.

I didn't think it would have been wise to modify the regex to accept both as that would make : ambiguous in terms of whether or not it was a sequence or a claim.

I also updated the tests so that the correct things are asserted against each other but didn't really try to add any more.

Running the whole test suite, I was able to get a consistent number of failures, so I don't think I added a regression. Though, it did seem like test_database.TestSQLiteRace.test_unhandled_sqlite_misuse seemed to toggle a fail or pass inconsistently, I didn't think that had anything to do with my changes.

Also, I know this repo isn't participating in hacktoberfest, but it'd be super cool if I could get t-shirt points if someone added the hacktoberfest-accepted label to this PR...

@kauffj
Copy link
Member

kauffj commented Oct 7, 2020

Thank you @Aenigma! If you have a LBRY address or channel, please email it to me at jeremy@lbry.com.

@eukreign this was missed on the stand up today. Should you review this or @lyoshenka?

@eukreign
Copy link
Member

@Aenigma thank you for this PR, sorry it took so long to respond.

regarding this comment "I didn't think it would have been wise to modify the regex to accept both as that would make : ambiguous in terms of whether or not it was a sequence or a claim.", can you elaborate on what you mean?

For example, if someone passes "foo:1" how would you disambiguate if this is claim_sequence or claim_id? It seems like it's impossible to disambiguate in a useful way, maybe I'm missing something though.

Thanks!

@Aenigma
Copy link
Contributor Author

Aenigma commented Oct 20, 2020

No problem. It is impossible to distinguish between the two semantically. You would effectively need to try one after the other in some way.

Making a regex do that would have made the code much harder to read is all I meant. I was (poorly) suggesting that by hinting at the nature of the problem here.

That's why in my approach I use two patterns: one for the old regex and one for the new and then try them until I get a hit.

@eukreign
Copy link
Member

@Aenigma would you be willing to change it to just use one regex? just or together :, # to parse as the claim_id?

@Aenigma
Copy link
Contributor Author

Aenigma commented Oct 20, 2020

Yeah, but that change will have certain consequences.

For example: @foo:1/bar#2

We can tell easily that this would have been parsed before as:

channel_sequence: 1
stream_claim_id: 2

And we know it's in the legacy format because it has a # sign. But if you use a regex that naively just accepts either for the claim_id, you'll get:

channel_claim_id: 1
stream_claim_id: 2

The consequence of misinterpreting this is that when it gets converted back to a string via PathSegment, it'll convert the URL to the new format as @foo:1/bar:2 which means something differently in the new format.

At least if you convert foo:1 and it's supposed to be a sequence and it's actually a claim, the outputted URL would still be foo:1, though understood differently.


Here's another example: @foo*1/bar#2

Here, it'll be parsed as:

channel_sequence: 1  
stream_claim_id: 2

But actually, this is a mishmash of the two formats and should not be allowed in either.

If the potential problem of situations like this are acceptable, that's fine -- I don't really grasp the reach of this utility so I genuinely don't know, but I thought that it likely wouldn't be OK.

The alternative is to make the regex be aware of the whole URL. This would involve abusing regexes to use lookaheads (which I'm not really comfortable doing) or doing something like _oneof(URL_REGEX, URL_REGEX_LEGACY) which doesn't provide much value to just doing it in Python. Additionally, since regex cannot use the same named groups twice, parsing out the match results would require more code.

@eukreign
Copy link
Member

@Aenigma I think we would want @foo:1/bar#2 to parse as two claim_ids for the channel and the claim inside the channel. we almost never used the : to mean sequence... this is why the spec change.

@kauffj kauffj requested a review from lyoshenka October 26, 2020 20:28
@lbry-bot lbry-bot assigned lyoshenka and eukreign and unassigned eukreign Oct 26, 2020
@kauffj kauffj removed the request for review from eukreign October 26, 2020 20:28
@Aenigma
Copy link
Contributor Author

Aenigma commented Oct 27, 2020

Sorry, was a bit low on bandwidth the past week. I made the change and the changeset seems quite minimal now.

However, I'm noticing that the CI faiilure seems to be related to my change:

  File "/home/runner/work/lbry-sdk/lbry-sdk/tests/integration/blockchain/test_resolve_command.py", line 121, in test_advanced_resolve
    await self.assertResolvesToClaimId('foo:1', claim_id1)
  File "/home/runner/work/lbry-sdk/lbry-sdk/tests/integration/blockchain/test_resolve_command.py", line 20, in assertResolvesToClaimId
    self.assertEqual(claim_id, other['claim_id'])
KeyError: 'claim_id'

Unfortunately, I can't seem to run the integration tests to run locally so I'm at a loss to figure out what the problem is here.

Aenigma and others added 4 commits October 30, 2020 10:44
Legacy URLs are preserved by attempting to parse the new URL format and,
on failing that, it'll attempt the legacy one.

Tests had to be updated such that the correct things are asserted
against each other.
This removes the code for trying multiple patterns and the setup for it

Added a few unit tests to check that the parsed URL is as expected
Copy link
Member

@eukreign eukreign left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR @Aenigma !

@lbry-bot lbry-bot assigned eukreign and unassigned eukreign Oct 30, 2020
@eukreign eukreign merged commit 6826cc3 into lbryio:master Oct 30, 2020
@eukreign eukreign added area: claims type: new feature New functionality that does not exist yet labels Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: claims type: new feature New functionality that does not exist yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update URL parsing to reflect spec updates
4 participants