Add reason newly posted youtube video and medium posts #4595

user12986714 · 2020-09-02T04:00:25Z

This PR tries to catch newly posted youtube videos.

ghost · 2020-09-03T00:07:41Z

Nice idea, have you tested the code?

user12986714 · 2020-09-03T03:18:02Z

@Daniil-M-beep I would be happy if someone can help me to test the code. The problem is that I am not capable of creating useful test accounts.

ghost · 2020-09-03T08:22:20Z

@user12986714 Ok. I'll try and organise some testing later today.

ghost · 2020-09-03T09:23:21Z

@user12986714 Apologies but it might not be today but I'll try and get it done relatively soon.

ghost · 2020-09-06T11:30:12Z

This has been tested by me and neither of the 2 rules work.

NobodyNada · 2020-09-12T05:46:23Z

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)
youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

user12986714 · 2020-09-13T00:27:12Z

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)

youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

NobodyNada · 2020-10-04T18:36:52Z

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

@user12986714 Those are both true. However, we've had to do API key deployment in the past (e.g. for Perspective), and it's pretty simple:

Add a new config entry for the API key, but make sure Smokey still runs fine without it (just with that rule disabled). That way, test instances/instances that haven't updated the API key will still work.
Update config.sample with a placeholder, and update the config on Keybase with the real key. That way, all future instances will include the key.
Send a message in the runner Keybase chat reminding the runners to add the key to their existing instances.

As far as extensibility goes: yes, we're writing special code to use the YouTube API, but that doesn't stop us from using regexes on Medium or Blogspot. I'm worried about using regex specifically on YouTube because YT is not scraper-friendly and I would prefer to stay out of that cat-and-mouse game.

NobodyNada · 2020-10-04T18:51:43Z

I've just taken a closer look at the Medium one, too. I'm concerned by the class="bh bi at au av aw ax ay az ba fu bd bl bm", as obfuscated classes like that are usually anti-adblock measures and are therefore periodically randomized. (Also, I'm just a little bit concerned by the regex; what if someone requests the page from Europe and gets a date of 4 Oct instead of Oct 4?)

However, there's a MUCH easier way to get the date out of a Medium post. Every Medium post has the following meta-tag:

<meta data-rh="true" property="article:published_time" content="[an ISO 8601 timestamp]">

Finally, since we already have BeautifulSoup, I'd suggest using that instead of regex to parse HTML, as it will be easier and more reliable.

stale · 2020-11-08T10:49:22Z

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

user12986714 added 7 commits September 1, 2020 23:59

Add reason newly posted youtube video

91d2566

Fix CI

27dc15a

Fix CI... again

958881a

Not to double import requests package

308118c

Better regex

1fd725d

Break out scrap_and_check()

b3d6015

Detect medium posts too

a776271

user12986714 changed the title ~~Add reason newly posted youtube video~~ Add reason newly posted youtube video and medium posts Sep 2, 2020

Minor bug fix

51b48d9

user12986714 added 2 commits September 6, 2020 10:23

Fix list index out of range

08cbab4

Detect more medium links

0a94542

stale bot added the status: stale label Nov 7, 2020

stale bot closed this Nov 8, 2020

double-beep added the status: confirmed Confirmed as something that needs working on. label Nov 8, 2020

double-beep reopened this Nov 8, 2020

stale bot removed the status: stale label Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reason newly posted youtube video and medium posts #4595

Add reason newly posted youtube video and medium posts #4595

user12986714 commented Sep 2, 2020

ghost commented Sep 3, 2020

user12986714 commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 6, 2020

NobodyNada commented Sep 12, 2020 •

edited

Loading

user12986714 commented Sep 13, 2020

NobodyNada commented Oct 4, 2020 •

edited

Loading

NobodyNada commented Oct 4, 2020

stale bot commented Nov 8, 2020

Add reason newly posted youtube video and medium posts #4595

Are you sure you want to change the base?

Add reason newly posted youtube video and medium posts #4595

Conversation

user12986714 commented Sep 2, 2020

ghost commented Sep 3, 2020

user12986714 commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 3, 2020

ghost commented Sep 6, 2020

NobodyNada commented Sep 12, 2020 • edited Loading

user12986714 commented Sep 13, 2020

NobodyNada commented Oct 4, 2020 • edited Loading

NobodyNada commented Oct 4, 2020

stale bot commented Nov 8, 2020

NobodyNada commented Sep 12, 2020 •

edited

Loading

NobodyNada commented Oct 4, 2020 •

edited

Loading