Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to skip a problematic page and move on #75

Closed
2 of 7 tasks
rivernews opened this issue May 18, 2020 · 3 comments
Closed
2 of 7 tasks

Ability to skip a problematic page and move on #75

rivernews opened this issue May 18, 2020 · 3 comments
Labels
bug Something isn't working Priority

Comments

@rivernews
Copy link
Owner

rivernews commented May 18, 2020

Some webpage failed locating review panel, because of a bug in the webpage. No matter how much we re-try, the result is the same. Probably this is due to glassdoor using server side rendering to generate static pages so the page doesn't change until the next update cycle. But we encounter the same problems as well, so looks like the update doesn't help.

Closer look into the cause (increase survival rate)

We currently don't have any idea of tackling this to try to scrape the content

  • In this regard, the webpage does seem to contain the review data at first, but after a second it seems to "flash" and suddenly the review panel vanished.
  • We may try to "freeze" the webpage at the time point the review panel is initially loaded, but prevent it from being modified later on.
  • New findings: when the review seems to have more data and the page (or next page) should have contain review data but we get this "review panel vanish" problem" - now we discover another case that the entire page remains blank - while the result is the same that review panel is missing, it may due to the fact we accessing a particularly page directly by url
    • If we go one page back in url and access it - now the webpage and review panel shows up. Then, we click on next page button. Bingo! Now the page shows up!
    • This gives us an insight: we should prioritize "find next page link and click on it" approach when trying to proceed to next page.
    • There's a blind spot that when 1) url direct access failed 2) cannot find next link, then we'll get the cannot locate review panel error again.
      • Be aware that click() method may lead to the "page state has changed so cannot findElement on tainted DOM".

Overall approach to increase throughput

We may focus on another approach to have the ability to skip

  • The entire session targets scraping 500 reviews. Because of the error, pages after are all missed.
  • Ideally we can
    • Identify such situation (make sure it's not other unknown situation)
    • Skip current page
    • Resume all existing logic but continuing from next page
    • Have a way to bring this up, either immediately, or when s3 job finalizes. This may be in the form of a report, perhaps stored in redis, and read by some backend system and present on a frontend page. Emails are not preferred. There's a ticket to Centralized log for k8 scraper job #74 centralize log and perhaps store in redis. But this one should be done separately - we need to separate the missing page issue from regular log, and highlight it.
@rivernews
Copy link
Owner Author

rivernews commented May 22, 2020

Click approach

There's issues when enabling click element approach - it will cause stale element reference: element is not attached to the page document. Perhaps the longer we manipulate a webelement, the more risk we will get this error. To mitigate this:

  • Use driver to find element as possible

@rivernews
Copy link
Owner Author

Infinite Job

  • After adding click approach, some progress overflowed. We need to check our termination logic in next page detection.

@rivernews rivernews added the bug Something isn't working label May 24, 2020
rivernews added a commit to rivernews/review-scraper-java-development-environment that referenced this issue May 27, 2020
@rivernews
Copy link
Owner Author

We approach this by bringing in click approach again, and put it after guess url approach. No error reported again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Priority
Projects
None yet
Development

No branches or pull requests

1 participant