receiving contacts/social media accounts for a given url #2

hbakhtiyor · 2017-05-30T11:26:11Z

I built quick version, and not yet implemented the list

Extract data from parsed structured data.
e.g.

<script type="application/ld+json">
{
    "@context": "http://schema.org",
    "@type": "Organization",
    "name": "Let's Validate",
    "url": "https://www.letsvalidate.com/",
    "logo": "https://www.letsvalidate.com/img/logo.png",
    "email": "info@letsvalidate.com",
    "description": "Site launch checklist checker",
    "sameAs": [
        "https://www.facebook.com/letsvalidate",
        "https://m.me/letsvalidate",
        "https://twitter.com/letsvalidate",
        "https://plus.google.com/+letsvalidate",
        "https://vk.com/letsvalidate"
    ]
}
</script>

Extract data from parsed meta data of twitter card.
e.g.

<meta name="twitter:site" content="@letsvalidate">

Prerender web apps before extracting data.
Deeply crawl, e.g. contact page
Maybe to use Google knowledge graph too?

the endpoint:

https://api.letsvalidate.com/v1/contacts?url=docker.com&prettify=true

result:

{
  "url": "https://www.docker.com/",
  "originalUrl": "http://docker.com",
  "contacts": {
    "email": null,
    "fax": null,
    "tel": null,
    "socials": [
      {
        "domain": "twitter.com",
        "id": null,
        "name": "docker",
        "confidence": 100,
        "url": "http://twitter.com/docker"
      },
      {
        "domain": "youtube.com",
        "id": null,
        "name": "dockerrun",
        "confidence": 100,
        "url": "http://www.youtube.com/user/dockerrun"
      },
      {
        "domain": "facebook.com",
        "id": null,
        "name": "docker.run",
        "confidence": 100,
        "url": "https://www.facebook.com/docker.run"
      }
    ]
  }
}

@JHabdas What do you think, is it worth to implement it or already available such api?

The text was updated successfully, but these errors were encountered:

ghost · 2017-05-30T12:26:40Z

Taking a look

ghost · 2017-05-30T12:53:32Z

Here are some specific thoughts on the approach. Please keep in mind these are more of a knee jerk reaction than anything, and contain some bias as I like to build simple easy-to-maintain apps which require little maintenance (so I can build other cool stuff).

First off. I'm not aware of an existing API to pull this kind of data. But I'd be surprised if some don't already exist and made available as a micro service which might be ingested for aggregation. That said, I don't see any harm in rolling your own as it'll be easier to maintain that way and you won't have to rely on a 3rd party which could fail and/or require maintenance.

Prerender web apps before extracting data.

For first pass I'd skip pre-rendering unless you've already got an easy way to scrape (Headless Chromium?) and focus on getting the Structured Data parsing logic right. Some initial questions that pop into mind is which of the Structured Data types take precedence when multiple are present. And which of those should win in the case of (a tie, incomplete data, data with a later associated date, if applicable).

Maybe to use Google knowledge graph too?

I'm not familiar with this. But Google knows a lot. Though it may be better to pull data from multiple sources to help ensure data independence and richness.

Deeply crawl, e.g. contact page

If you do this probably just look at, what, the /about and /contact, or build a small list. I'm not sure if there's a semantic way to identify the location of this page. Not sure how Web Feeds (RSS/Atom) would help here but they may be useful in making determinations about site structure.

Jekyll SEO Tag gem has unit tests you could look at to see what things it looks for when it produces it's meta data. WordPress could be another place to look since I believe most of the sites on the Web today are actually WordPress and not anything else.

If building I'd try and lean into specs as much as possible and return null on anything which doesn't conform with a chosen specification. For structured and social data those specs basically boil down to schema.org (three types of meta), twitter dev and http://ogp.me/.

While scraping you may find some value in Portia to help define the implementation logic visually so you don't end up pulling your hair out trying to get the scraping nailed down: https://github.com/scrapinghub/portia

EDIT HERE: Sorry, since you're pulling from Meta probably best to skip portia and build the tests starting with https://github.com/scrapinghub/scrapy or similar if it makes sense in the environment and toolset being used currently.

EDIT 2: Probably better not to use a fork of Scrapy. 😝 https://github.com/scrapy/scrapy

Not sure if that's helpful. Just some thoughts.

ghost · 2017-05-30T13:03:06Z

One more thing. IIRC https://scrapinghub.com has a list of existing services (somewhere) where people have already defined their own scrapers which collect data. You might be able to take the blue pill and just combine a few of these to build out some relatively simple heuristics logic to combine them for the API output with a level of fault tolerance not possible using a single 3rd party.

EDIT: Scratch that. Terrible idea. But the existing scrapes may be extremely insightful to help build out the algo for the API.

hbakhtiyor · 2017-05-30T16:36:32Z

wow, thanks a lot for your advices and taking your times.

using headless chrome only for capturing screenshots, for js rendering, i consider https://github.com/scrapinghub/splash for its lightweight

how about the idea itself? anyone will be interested in?

hbakhtiyor added feature help wanted question labels May 30, 2017

hbakhtiyor self-assigned this May 30, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

receiving contacts/social media accounts for a given url #2

receiving contacts/social media accounts for a given url #2

hbakhtiyor commented May 30, 2017

ghost commented May 30, 2017

ghost commented May 30, 2017 •

edited by ghost

Loading

ghost commented May 30, 2017 •

edited by ghost

Loading

hbakhtiyor commented May 30, 2017

receiving contacts/social media accounts for a given url #2

receiving contacts/social media accounts for a given url #2

Comments

hbakhtiyor commented May 30, 2017

ghost commented May 30, 2017

ghost commented May 30, 2017 • edited by ghost Loading

ghost commented May 30, 2017 • edited by ghost Loading

hbakhtiyor commented May 30, 2017

ghost commented May 30, 2017 •

edited by ghost

Loading

ghost commented May 30, 2017 •

edited by ghost

Loading