-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receiving contacts/social media accounts for a given url #2
Comments
Taking a look |
Here are some specific thoughts on the approach. Please keep in mind these are more of a knee jerk reaction than anything, and contain some bias as I like to build simple easy-to-maintain apps which require little maintenance (so I can build other cool stuff). First off. I'm not aware of an existing API to pull this kind of data. But I'd be surprised if some don't already exist and made available as a micro service which might be ingested for aggregation. That said, I don't see any harm in rolling your own as it'll be easier to maintain that way and you won't have to rely on a 3rd party which could fail and/or require maintenance.
For first pass I'd skip pre-rendering unless you've already got an easy way to scrape (Headless Chromium?) and focus on getting the Structured Data parsing logic right. Some initial questions that pop into mind is which of the Structured Data types take precedence when multiple are present. And which of those should win in the case of (a tie, incomplete data, data with a later associated date, if applicable).
I'm not familiar with this. But Google knows a lot. Though it may be better to pull data from multiple sources to help ensure data independence and richness.
If you do this probably just look at, what, the Jekyll SEO Tag gem has unit tests you could look at to see what things it looks for when it produces it's meta data. WordPress could be another place to look since I believe most of the sites on the Web today are actually WordPress and not anything else. If building I'd try and lean into specs as much as possible and return While scraping you may find some value in Portia to help define the implementation logic visually so you don't end up pulling your hair out trying to get the scraping nailed down: https://github.com/scrapinghub/portia EDIT HERE: Sorry, since you're pulling from Meta probably best to skip portia and build the tests starting with https://github.com/scrapinghub/scrapy or similar if it makes sense in the environment and toolset being used currently. EDIT 2: Probably better not to use a fork of Scrapy. 😝 https://github.com/scrapy/scrapy Not sure if that's helpful. Just some thoughts. |
One more thing. IIRC https://scrapinghub.com has a list of existing services (somewhere) where people have already defined their own scrapers which collect data. You might be able to take the blue pill and just combine a few of these to build out some relatively simple heuristics logic to combine them for the API output with a level of fault tolerance not possible using a single 3rd party. EDIT: Scratch that. Terrible idea. But the existing scrapes may be extremely insightful to help build out the algo for the API. |
wow, thanks a lot for your advices and taking your times. using headless chrome only for capturing screenshots, for js rendering, i consider https://github.com/scrapinghub/splash for its lightweight how about the idea itself? anyone will be interested in? |
I built quick version, and not yet implemented the list
e.g.
e.g.
the endpoint:
https://api.letsvalidate.com/v1/contacts?url=docker.com&prettify=true
result:
@JHabdas What do you think, is it worth to implement it or already available such api?
The text was updated successfully, but these errors were encountered: