Added support for spec compliant URL parsing #3056

Aenigma · 2020-10-06T23:40:14Z

Hello! I was looking at another project while doing hacktoberfest wondering how URL parsing works for lbry and stumbled upon this file. After studying it, I noticed some inconsistencies with the spec so I decided I may have enough context to be able to fix it.

I believe this would close #2832

I saw that in the issue that you guys wanted to preserve support for the legacy URL scheme so I made it so that the new scheme is attempted before attempting the legacy.

I didn't think it would have been wise to modify the regex to accept both as that would make : ambiguous in terms of whether or not it was a sequence or a claim.

I also updated the tests so that the correct things are asserted against each other but didn't really try to add any more.

Running the whole test suite, I was able to get a consistent number of failures, so I don't think I added a regression. Though, it did seem like test_database.TestSQLiteRace.test_unhandled_sqlite_misuse seemed to toggle a fail or pass inconsistently, I didn't think that had anything to do with my changes.

Also, I know this repo isn't participating in hacktoberfest, but it'd be super cool if I could get t-shirt points if someone added the hacktoberfest-accepted label to this PR...

kauffj · 2020-10-07T17:38:50Z

Thank you @Aenigma! If you have a LBRY address or channel, please email it to me at jeremy@lbry.com.

@eukreign this was missed on the stand up today. Should you review this or @lyoshenka?

eukreign · 2020-10-20T15:57:21Z

@Aenigma thank you for this PR, sorry it took so long to respond.

regarding this comment "I didn't think it would have been wise to modify the regex to accept both as that would make : ambiguous in terms of whether or not it was a sequence or a claim.", can you elaborate on what you mean?

For example, if someone passes "foo:1" how would you disambiguate if this is claim_sequence or claim_id? It seems like it's impossible to disambiguate in a useful way, maybe I'm missing something though.

Thanks!

Aenigma · 2020-10-20T16:16:51Z

No problem. It is impossible to distinguish between the two semantically. You would effectively need to try one after the other in some way.

Making a regex do that would have made the code much harder to read is all I meant. I was (poorly) suggesting that by hinting at the nature of the problem here.

That's why in my approach I use two patterns: one for the old regex and one for the new and then try them until I get a hit.

eukreign · 2020-10-20T17:33:21Z

@Aenigma would you be willing to change it to just use one regex? just or together :, # to parse as the claim_id?

Aenigma · 2020-10-20T18:19:42Z

Yeah, but that change will have certain consequences.

For example: @foo:1/bar#2

We can tell easily that this would have been parsed before as:

channel_sequence: 1
stream_claim_id: 2

And we know it's in the legacy format because it has a # sign. But if you use a regex that naively just accepts either for the claim_id, you'll get:

channel_claim_id: 1
stream_claim_id: 2

The consequence of misinterpreting this is that when it gets converted back to a string via PathSegment, it'll convert the URL to the new format as @foo:1/bar:2 which means something differently in the new format.

At least if you convert foo:1 and it's supposed to be a sequence and it's actually a claim, the outputted URL would still be foo:1, though understood differently.

Here's another example: @foo*1/bar#2

Here, it'll be parsed as:

channel_sequence: 1  
stream_claim_id: 2

But actually, this is a mishmash of the two formats and should not be allowed in either.

If the potential problem of situations like this are acceptable, that's fine -- I don't really grasp the reach of this utility so I genuinely don't know, but I thought that it likely wouldn't be OK.

The alternative is to make the regex be aware of the whole URL. This would involve abusing regexes to use lookaheads (which I'm not really comfortable doing) or doing something like _oneof(URL_REGEX, URL_REGEX_LEGACY) which doesn't provide much value to just doing it in Python. Additionally, since regex cannot use the same named groups twice, parsing out the match results would require more code.

eukreign · 2020-10-20T19:43:28Z

@Aenigma I think we would want @foo:1/bar#2 to parse as two claim_ids for the channel and the claim inside the channel. we almost never used the : to mean sequence... this is why the spec change.

Aenigma · 2020-10-27T19:04:34Z

Sorry, was a bit low on bandwidth the past week. I made the change and the changeset seems quite minimal now.

However, I'm noticing that the CI faiilure seems to be related to my change:

  File "/home/runner/work/lbry-sdk/lbry-sdk/tests/integration/blockchain/test_resolve_command.py", line 121, in test_advanced_resolve
    await self.assertResolvesToClaimId('foo:1', claim_id1)
  File "/home/runner/work/lbry-sdk/lbry-sdk/tests/integration/blockchain/test_resolve_command.py", line 20, in assertResolvesToClaimId
    self.assertEqual(claim_id, other['claim_id'])
KeyError: 'claim_id'

Unfortunately, I can't seem to run the integration tests to run locally so I'm at a loss to figure out what the problem is here.

Legacy URLs are preserved by attempting to parse the new URL format and, on failing that, it'll attempt the legacy one. Tests had to be updated such that the correct things are asserted against each other.

This removes the code for trying multiple patterns and the setup for it Added a few unit tests to check that the parsed URL is as expected

eukreign

thanks for the PR @Aenigma !

kauffj requested a review from eukreign October 7, 2020 17:37

lbry-bot assigned eukreign Oct 7, 2020

kauffj requested a review from lyoshenka October 7, 2020 17:38

lbry-bot assigned lyoshenka and unassigned eukreign Oct 7, 2020

eukreign added the hacktoberfest-accepted label Oct 9, 2020

lyoshenka removed their request for review October 12, 2020 14:47

lbry-bot unassigned lyoshenka Oct 12, 2020

kauffj assigned eukreign Oct 20, 2020

kauffj requested a review from lyoshenka October 26, 2020 20:28

lbry-bot assigned lyoshenka and eukreign and unassigned eukreign Oct 26, 2020

kauffj removed the request for review from eukreign October 26, 2020 20:28

lbry-bot unassigned eukreign Oct 26, 2020

lyoshenka requested a review from eukreign October 27, 2020 21:27

lbry-bot assigned eukreign Oct 27, 2020

lyoshenka removed their request for review October 28, 2020 17:04

lbry-bot unassigned lyoshenka Oct 28, 2020

Aenigma and others added 4 commits October 30, 2020 10:44

Added support for spec compliant URL parsing

87bc186

Legacy URLs are preserved by attempting to parse the new URL format and, on failing that, it'll attempt the legacy one. Tests had to be updated such that the correct things are asserted against each other.

Allow : or # for claim_id

6437de1

This removes the code for trying multiple patterns and the setup for it Added a few unit tests to check that the parsed URL is as expected

minor fixup

e542d50

update test to use new url spec

657d690

eukreign force-pushed the feature/fix-parse-new-spec branch from 534b3c4 to 657d690 Compare October 30, 2020 14:44

eukreign approved these changes Oct 30, 2020

View reviewed changes

lbry-bot assigned eukreign and unassigned eukreign Oct 30, 2020

eukreign merged commit 6826cc3 into lbryio:master Oct 30, 2020

eukreign added area: claims type: new feature New functionality that does not exist yet labels Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for spec compliant URL parsing #3056

Added support for spec compliant URL parsing #3056

Aenigma commented Oct 6, 2020 •

edited

Loading

kauffj commented Oct 7, 2020

eukreign commented Oct 20, 2020

Aenigma commented Oct 20, 2020

eukreign commented Oct 20, 2020

Aenigma commented Oct 20, 2020 •

edited

Loading

eukreign commented Oct 20, 2020

Aenigma commented Oct 27, 2020

eukreign left a comment

Added support for spec compliant URL parsing #3056

Added support for spec compliant URL parsing #3056

Conversation

Aenigma commented Oct 6, 2020 • edited Loading

kauffj commented Oct 7, 2020

eukreign commented Oct 20, 2020

Aenigma commented Oct 20, 2020

eukreign commented Oct 20, 2020

Aenigma commented Oct 20, 2020 • edited Loading

eukreign commented Oct 20, 2020

Aenigma commented Oct 27, 2020

eukreign left a comment

Choose a reason for hiding this comment

Aenigma commented Oct 6, 2020 •

edited

Loading

Aenigma commented Oct 20, 2020 •

edited

Loading