Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

receiving contacts/social media accounts for a given url #2

Open
hbakhtiyor opened this issue May 30, 2017 · 4 comments
Open

receiving contacts/social media accounts for a given url #2

hbakhtiyor opened this issue May 30, 2017 · 4 comments

Comments

@hbakhtiyor
Copy link
Member

I built quick version, and not yet implemented the list

  • Extract data from parsed structured data.
    e.g.
<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "name": "Let's Validate",
    "url": "https://www.letsvalidate.com/",
    "logo": "https://www.letsvalidate.com/img/logo.png",
    "email": "info@letsvalidate.com",
    "description": "Site launch checklist checker",
    "sameAs": [
        "https://www.facebook.com/letsvalidate",
        "https://m.me/letsvalidate",
        "https://twitter.com/letsvalidate",
        "https://plus.google.com/+letsvalidate",
        "https://vk.com/letsvalidate"
    ]
}
</script>
  • Extract data from parsed meta data of twitter card.
    e.g.
<meta name="twitter:site" content="@letsvalidate">
  • Prerender web apps before extracting data.
  • Deeply crawl, e.g. contact page
  • Maybe to use Google knowledge graph too?

the endpoint:

https://api.letsvalidate.com/v1/contacts?url=docker.com&prettify=true

result:

{
  "url": "https://www.docker.com/",
  "originalUrl": "http://docker.com",
  "contacts": {
    "email": null,
    "fax": null,
    "tel": null,
    "socials": [
      {
        "domain": "twitter.com",
        "id": null,
        "name": "docker",
        "confidence": 100,
        "url": "http://twitter.com/docker"
      },
      {
        "domain": "youtube.com",
        "id": null,
        "name": "dockerrun",
        "confidence": 100,
        "url": "http://www.youtube.com/user/dockerrun"
      },
      {
        "domain": "facebook.com",
        "id": null,
        "name": "docker.run",
        "confidence": 100,
        "url": "https://www.facebook.com/docker.run"
      }
    ]
  }
}

@JHabdas What do you think, is it worth to implement it or already available such api?

@ghost
Copy link

ghost commented May 30, 2017

Taking a look

@ghost
Copy link

ghost commented May 30, 2017

Here are some specific thoughts on the approach. Please keep in mind these are more of a knee jerk reaction than anything, and contain some bias as I like to build simple easy-to-maintain apps which require little maintenance (so I can build other cool stuff).

First off. I'm not aware of an existing API to pull this kind of data. But I'd be surprised if some don't already exist and made available as a micro service which might be ingested for aggregation. That said, I don't see any harm in rolling your own as it'll be easier to maintain that way and you won't have to rely on a 3rd party which could fail and/or require maintenance.

Prerender web apps before extracting data.

For first pass I'd skip pre-rendering unless you've already got an easy way to scrape (Headless Chromium?) and focus on getting the Structured Data parsing logic right. Some initial questions that pop into mind is which of the Structured Data types take precedence when multiple are present. And which of those should win in the case of (a tie, incomplete data, data with a later associated date, if applicable).

Maybe to use Google knowledge graph too?

I'm not familiar with this. But Google knows a lot. Though it may be better to pull data from multiple sources to help ensure data independence and richness.

Deeply crawl, e.g. contact page

If you do this probably just look at, what, the /about and /contact, or build a small list. I'm not sure if there's a semantic way to identify the location of this page. Not sure how Web Feeds (RSS/Atom) would help here but they may be useful in making determinations about site structure.


Jekyll SEO Tag gem has unit tests you could look at to see what things it looks for when it produces it's meta data. WordPress could be another place to look since I believe most of the sites on the Web today are actually WordPress and not anything else.

If building I'd try and lean into specs as much as possible and return null on anything which doesn't conform with a chosen specification. For structured and social data those specs basically boil down to schema.org (three types of meta), twitter dev and http://ogp.me/.

While scraping you may find some value in Portia to help define the implementation logic visually so you don't end up pulling your hair out trying to get the scraping nailed down: https://github.com/scrapinghub/portia

EDIT HERE: Sorry, since you're pulling from Meta probably best to skip portia and build the tests starting with https://github.com/scrapinghub/scrapy or similar if it makes sense in the environment and toolset being used currently.

EDIT 2: Probably better not to use a fork of Scrapy. 😝 https://github.com/scrapy/scrapy

Not sure if that's helpful. Just some thoughts.

@ghost
Copy link

ghost commented May 30, 2017

One more thing. IIRC https://scrapinghub.com has a list of existing services (somewhere) where people have already defined their own scrapers which collect data. You might be able to take the blue pill and just combine a few of these to build out some relatively simple heuristics logic to combine them for the API output with a level of fault tolerance not possible using a single 3rd party.

EDIT: Scratch that. Terrible idea. But the existing scrapes may be extremely insightful to help build out the algo for the API.

@hbakhtiyor
Copy link
Member Author

wow, thanks a lot for your advices and taking your times.

using headless chrome only for capturing screenshots, for js rendering, i consider https://github.com/scrapinghub/splash for its lightweight

how about the idea itself? anyone will be interested in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant