Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strengthen research with google API #26

Open
marcoscaceres opened this issue Nov 28, 2013 · 3 comments
Open

Strengthen research with google API #26

marcoscaceres opened this issue Nov 28, 2013 · 3 comments

Comments

@marcoscaceres
Copy link
Contributor

We can probably draw on the following to strengthen the findings. It's not as accurate, but it's covers a much much larger data set so it could be used to strengthen findings.

http://git.macropus.org/meta-tag-usage/

@marcoscaceres
Copy link
Contributor Author

(it's also not verifiable)

@ernesto-jimenez
Copy link
Member

Wouldn't it be better to limit ourselves to more accurate and verifiable sources?

I would rather do a quick crawler that downloads a website, extracts the key information we want and discards the HTML. That would save a lot of space and could be done easily.

The uncompressed october dataset is 5.9GB while all the CSVs I've generated for webdevdata-reports are 567MB:

~/webmob-reports% du -sh webdevdata-latest/
5.9G        webdevdata-latest/
~/webmob-reports% du -sh csv_out
567M    csv_out
~/webmob-reports% wc -l csv_out/*
 1933179 csv_out/all_tags.csv
  527432 csv_out/link_tags.csv
  275799 csv_out/link_tags_stylesheet.csv
  125825 csv_out/link_tags_stylesheet_media.csv
  326287 csv_out/meta_tags.csv
    1816 csv_out/meta_tags_application_names.csv
   15926 csv_out/meta_tags_viewport.csv
  641462 csv_out/script_tags.csv
 3847726 total

@marcoscaceres
Copy link
Contributor Author

On Thursday, November 28, 2013 at 11:44 PM, Ernesto Jiménez wrote:

Wouldn't it be better to limit ourselves to more accurate and verifiable sources?

I don’t think we should limit our selves. We are able to provide verifiable results, which is great - but as a secondary source that is able to show use at “web scale”, it certainly helps strengthen our argument. It gives an indication of the reach of a given feature beyond our dataset (even if unverifiable). Having said that, I strongly agree that we should not use it as a primary source - as we don’t know what each search from google actually means (could look that up).

I would rather do a quick crawler that downloads a website, extracts the key information we want and discards the HTML. That would save a lot of space and could be done easily.
The uncompressed october dataset is 5.9GB while all the CSVs I've generated for webdevdata-reports (https://github.com/ernesto-jimenez/webdevdata-reports) are 567MB:
~/webmob-reports% du -sh webdevdata-latest/ 5.9G webdevdata-latest/ ~/webmob-reports% du -sh csv_out 567M csv_out ~/webmob-reports% wc -l csv_out/* 1933179 csv_out/all_tags.csv 527432 csv_out/link_tags.csv 275799 csv_out/link_tags_stylesheet.csv 125825 csv_out/link_tags_stylesheet_media.csv 326287 csv_out/meta_tags.csv 1816 csv_out/meta_tags_application_names.csv 15926 csv_out/meta_tags_viewport.csv 641462 csv_out/script_tags.csv 3847726 total

That could be quite an efficient way of doing this. If we know exactly what we are looking for, then we could broaden our search - specially if we could split the task amongst a cluster of computers. Then we could easily search the top 1,000,000 if each machine d/l 100,000 home pages in a very targeted way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants