Ability to skip a problematic page and move on #75

rivernews · 2020-05-18T16:45:23Z

Some webpage failed locating review panel, because of a bug in the webpage. No matter how much we re-try, the result is the same. Probably this is due to glassdoor using server side rendering to generate static pages so the page doesn't change until the next update cycle. But we encounter the same problems as well, so looks like the update doesn't help.

Closer look into the cause (increase survival rate)

We currently don't have any idea of tackling this to try to scrape the content

In this regard, the webpage does seem to contain the review data at first, but after a second it seems to "flash" and suddenly the review panel vanished.
We may try to "freeze" the webpage at the time point the review panel is initially loaded, but prevent it from being modified later on.
New findings: when the review seems to have more data and the page (or next page) should have contain review data but we get this "review panel vanish" problem" - now we discover another case that the entire page remains blank - while the result is the same that review panel is missing, it may due to the fact we accessing a particularly page directly by url
- If we go one page back in url and access it - now the webpage and review panel shows up. Then, we click on next page button. Bingo! Now the page shows up!
- This gives us an insight: we should prioritize "find next page link and click on it" approach when trying to proceed to next page.
- There's a blind spot that when 1) url direct access failed 2) cannot find next link, then we'll get the cannot locate review panel error again.
  - Be aware that click() method may lead to the "page state has changed so cannot findElement on tainted DOM".

Overall approach to increase throughput

We may focus on another approach to have the ability to skip

The entire session targets scraping 500 reviews. Because of the error, pages after are all missed.
Ideally we can
- Identify such situation (make sure it's not other unknown situation)
- Skip current page
- Resume all existing logic but continuing from next page
- Have a way to bring this up, either immediately, or when s3 job finalizes. This may be in the form of a report, perhaps stored in redis, and read by some backend system and present on a frontend page. Emails are not preferred. There's a ticket to Centralized log for k8 scraper job #74 centralize log and perhaps store in redis. But this one should be done separately - we need to separate the missing page issue from regular log, and highlight it.

The text was updated successfully, but these errors were encountered:

…ng with rivernews/slack-middleware-server#75

rivernews · 2020-05-22T23:17:34Z

Click approach

There's issues when enabling click element approach - it will cause stale element reference: element is not attached to the page document. Perhaps the longer we manipulate a webelement, the more risk we will get this error. To mitigate this:

Use driver to find element as possible

rivernews · 2020-05-24T16:18:18Z

Infinite Job

After adding click approach, some progress overflowed. We need to check our termination logic in next page detection.

…dleware-server#75

rivernews · 2020-05-29T21:17:20Z

We approach this by bringing in click approach again, and put it after guess url approach. No error reported again.

rivernews added the Priority label May 18, 2020

rivernews mentioned this issue May 18, 2020

Search for better cost-effficient cloud provider #62

Closed

rivernews added a commit to rivernews/review-scraper-java-development-environment that referenced this issue May 22, 2020

add click appraoch after direct url access for next page fails; deali…

c3f7db3

…ng with rivernews/slack-middleware-server#75

rivernews added the bug Something isn't working label May 24, 2020

rivernews added a commit to rivernews/review-scraper-java-development-environment that referenced this issue May 27, 2020

avoid infinitive loop in next page logic; solving rivernews/slack-mid…

4694730

…dleware-server#75

rivernews mentioned this issue May 29, 2020

Some large org not splitted #81

Closed

rivernews closed this as completed May 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to skip a problematic page and move on #75

Ability to skip a problematic page and move on #75

rivernews commented May 18, 2020 •

edited

Loading

rivernews commented May 22, 2020 •

edited

Loading

rivernews commented May 24, 2020

rivernews commented May 29, 2020

Ability to skip a problematic page and move on #75

Ability to skip a problematic page and move on #75

Comments

rivernews commented May 18, 2020 • edited Loading

Closer look into the cause (increase survival rate)

Overall approach to increase throughput

rivernews commented May 22, 2020 • edited Loading

Click approach

rivernews commented May 24, 2020

Infinite Job

rivernews commented May 29, 2020

rivernews commented May 18, 2020 •

edited

Loading

rivernews commented May 22, 2020 •

edited

Loading