Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reason newly posted youtube video and medium posts #4595

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

user12986714
Copy link
Contributor

This PR tries to catch newly posted youtube videos.

@user12986714 user12986714 changed the title Add reason newly posted youtube video Add reason newly posted youtube video and medium posts Sep 2, 2020
@ghost
Copy link

ghost commented Sep 3, 2020

Nice idea, have you tested the code?

@user12986714
Copy link
Contributor Author

@Daniil-M-beep I would be happy if someone can help me to test the code. The problem is that I am not capable of creating useful test accounts.

@ghost
Copy link

ghost commented Sep 3, 2020

@user12986714 Ok. I'll try and organise some testing later today.

@ghost
Copy link

ghost commented Sep 3, 2020

@user12986714 Apologies but it might not be today but I'll try and get it done relatively soon.

@ghost
Copy link

ghost commented Sep 6, 2020

This has been tested by me and neither of the 2 rules work.

@NobodyNada
Copy link
Member

NobodyNada commented Sep 12, 2020

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

  • The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)
  • youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

@user12986714
Copy link
Contributor Author

I really like the idea, but I'm skeptical of rolling-our-own YouTube scraper, just because it is very much subject-to-change and might end up being troublesome to maintain down-the-road. There may be a couple of more stable alternatives:

  • The YouTube Data API -- looks like we can make 10k requests/day for free. (Of course Google isn't exactly known for keeping their APIs operational long-term either.)
  • youtube-dl is a popular and frequently-updated Python command-line-tool/library for scraping YouTube. I haven't tried it and there doesn't seem to be much of an API documentation, but it looks like you should be able to import the library and call extract_info(url, download=False), which will return a dictionary with an upload_date parameter.

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

@NobodyNada
Copy link
Member

NobodyNada commented Oct 4, 2020

I agree it would be a great idea if we can use the API, however it requires coordination w.r.t. API keys and limits the extensibility of such detection mechanisms. For example, it would be great if we later expand to other blog sites like blogspot or something else.

@user12986714 Those are both true. However, we've had to do API key deployment in the past (e.g. for Perspective), and it's pretty simple:

  1. Add a new config entry for the API key, but make sure Smokey still runs fine without it (just with that rule disabled). That way, test instances/instances that haven't updated the API key will still work.
  2. Update config.sample with a placeholder, and update the config on Keybase with the real key. That way, all future instances will include the key.
  3. Send a message in the runner Keybase chat reminding the runners to add the key to their existing instances.

As far as extensibility goes: yes, we're writing special code to use the YouTube API, but that doesn't stop us from using regexes on Medium or Blogspot. I'm worried about using regex specifically on YouTube because YT is not scraper-friendly and I would prefer to stay out of that cat-and-mouse game.

@NobodyNada
Copy link
Member

I've just taken a closer look at the Medium one, too. I'm concerned by the class="bh bi at au av aw ax ay az ba fu bd bl bm", as obfuscated classes like that are usually anti-adblock measures and are therefore periodically randomized. (Also, I'm just a little bit concerned by the regex; what if someone requests the page from Europe and gets a date of 4 Oct instead of Oct 4?)

However, there's a MUCH easier way to get the date out of a Medium post. Every Medium post has the following meta-tag:

<meta data-rh="true" property="article:published_time" content="[an ISO 8601 timestamp]">

Finally, since we already have BeautifulSoup, I'd suggest using that instead of regex to parse HTML, as it will be easier and more reliable.

@stale stale bot added the status: stale label Nov 7, 2020
@stale
Copy link

stale bot commented Nov 8, 2020

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

@stale stale bot closed this Nov 8, 2020
@double-beep double-beep added the status: confirmed Confirmed as something that needs working on. label Nov 8, 2020
@double-beep double-beep reopened this Nov 8, 2020
@stale stale bot removed the status: stale label Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: confirmed Confirmed as something that needs working on.
Development

Successfully merging this pull request may close these issues.

3 participants