Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper: project contact name/email #4

Open
2 tasks
benlk opened this issue May 26, 2020 · 4 comments
Open
2 tasks

Scraper: project contact name/email #4

benlk opened this issue May 26, 2020 · 4 comments
Assignees
Labels

Comments

@benlk
Copy link
Contributor

benlk commented May 26, 2020

// project contact name (postmeta project-contact-name)
// project contact email (postmeta project-contact-email)
try {
var raw_contact = $( value ).find( 'p.contact' ).text();
// @todo remove preface text, split this into email and name
row.project_contact_email = raw_contact;
row.project_contact_name = raw_contact;
} catch ( error ) {
console.error( 'error processing project contact', error, value );
}

  • Remove label prefix from this text
  • split into name and email fields
@benlk benlk self-assigned this May 26, 2020
@benlk
Copy link
Contributor Author

benlk commented May 26, 2020

Formats seen:

Name(s), email
Name(s), email, email
Name(s), email, phone
Name(s), email, phone, phone extension
Name(s), email, phone phone extension
Name(s), job title, location, email, phone
lastname, firstname email, phone
, — for the entry titled "10708"

Contact: Laura, Frank laurafrank@rmpbs.org

Contact: Jordan Escobar, jescobar@valleypbs.org, 559-266-1800 ext 360
Contact: Jordan Escobar, jescobar@valleypbs.org, 559-266-1800, x360

Because there is not a consistent scheme here, they will have to edit contact info by hand for imports.

@benlk
Copy link
Contributor Author

benlk commented May 26, 2020

We could fix some of that on a per-case basis in the import, but that will take extra time.

@benlk
Copy link
Contributor Author

benlk commented May 27, 2020

2h tentatively approved to do:

  • clean up of data using sed
  • scrape names and emails into separate entries, using the code from the term scraper that looks backwards through the array

We're waiting for feedback on:

  • should we make the phone number its own field, or just append it to the name field?

@benlk benlk reopened this May 27, 2020
@benlk
Copy link
Contributor Author

benlk commented May 27, 2020

  • Phone number is now its own field in the CSV project_contact_phone
  • Replace in sed was too promiscuous; redid in injection.js for just the contact field.

output.csv.txt

This file should:

  • should not contain the string XCOMMA
  • contains the following columns: project_org | project_org_type | project_contact_name | project_contact_email | post_title | post_excerpt | post_content | project_link | project_category | post_thumbnail_image | post_thumbnail_caption | post_thumbnail_credit | post_embed | project_status | post_date | project_contact_phone
  • does not have contact info for all projects; it wasn't supplied
  • contains job titles in the project_contact_name field where those were given
  • contains phone number extensions in the project_contact_phone field where those were given
  • contains the description with HTML stripped as post_excerpt
  • contains the description without HTML stripped as post_content

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant