Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scraper - possible non-identical guest/host pages from various shows? #72

Open
gerbrent opened this issue Jul 10, 2022 · 7 comments · May be fixed by #75
Open

scraper - possible non-identical guest/host pages from various shows? #72

gerbrent opened this issue Jul 10, 2022 · 7 comments · May be fixed by #75
Assignees
Labels
in progress currently being worked on
Milestone

Comments

@gerbrent
Copy link
Collaborator

Curious how we're dealing with merging potentially inconsistent guest or host data (description, links, photos, etc) when scraping across shows, since we are attempting to merge guest/host profiles into a single source-of-truth/entry here in hugo.

I ask since I'm in the process of creating an inconsistency on Fireside ; )

@gerbrent gerbrent added the question Further information is requested label Jul 10, 2022
@kbondarev
Copy link
Collaborator

So there's no merging of any kind. I tried to implemented it as simply as possible (and now examining it I understand it wasn't ideal and actually a little messy).

Basically each host/guest is uniquely identified by their username which is extracted from the URL to their page, i.e. https://coder.show/hosts/michael -> means username is michael. The avatar photo will come from the first occurrence of that username on the first show (ordered as in the config.yml), the rest of the data (here comes the messy part 😜) will come from the last show.

The intend was to grab everything from the first occurrence, but I realized that I'm overriding the data all the time (not the img tho) by mistake, so it end ups being the last.

@kbondarev
Copy link
Collaborator

I'm going to do something to maybe save all the different versions of the data and images, and then we will have to merge them manually.

@kbondarev kbondarev linked a pull request Jul 10, 2022 that will close this issue
@kbondarev
Copy link
Collaborator

kbondarev commented Jul 10, 2022

See PR draft #75

The scraper grabbed the host/guest from the first occurrence it saw and saved it as username.json. Then any time it found the same username in a different show with different data it saved another json file as __username_SHOW.json

In terms of the avatar, I just left it to save the image that's on the first show it found that person

Here are all the additional files that got saved:

__alex_JE.json
__brent_LUP.json
__brent_OH.json
__chris_CR.json
__chris_LAN.json
__chris_LUP.json
__chris_OH.json
__christianschaller_LUP.json
__chzbacon_LUP.json
__daltondurst_LUP.json
__nealgompa_JE.json
__wes_LAN.json
__wes_LUP.json

@gerbrent I'll leave it up to you to merge them, because you gotta think about the content of the bio. I suggest using meld to see all the diffs and merge it together. Or at least just the diff command to quickly see what's up:

diff brent.json __brent_OH.json 
5c5
<     "bio": "Commercial, Editorial & Documentary Photography -- local food, sustainability, technology, photography on Linux + more!",
---
>     "bio": "Photography in Sudbury, ON. -- Commercial, Editorial & Documentary Photography -- local food, sustainability, technology, photography on Linux + more!",
8c8
<     "homepage": "http://brentgervais.com",
---
>     "homepage": "https://www.brentgervais.com/",
11c11
<     "instagram": null,
---
>     "instagram": "https://instagram.com/brentgervais",

@gerbrent gerbrent self-assigned this Jul 11, 2022
@gerbrent gerbrent added the in progress currently being worked on label Jul 11, 2022
@gerbrent
Copy link
Collaborator Author

nice work, added to my list.

@gerbrent
Copy link
Collaborator Author

Hmm.... also looks like my bio is outdated ; )

@gerbrent gerbrent added this to the Hugo Website 1.0 milestone Jul 18, 2022
@gerbrent gerbrent removed the question Further information is requested label Jul 22, 2022
@kbondarev
Copy link
Collaborator

Important:

Any file that will be manually edited (basically merged/consolidated from multiple variant files) should be added to the data_dont_override list in the config.yml of the show-scraper

@kbondarev
Copy link
Collaborator

kbondarev commented Aug 12, 2022

After #110 was completed the files aren't JSON anymore.

I added some commits with the new hugo md files for the variants in #75.

This comment still applies, but now they are all .md files in content/people

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress currently being worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants