Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batting_stats_range breaks when parsing through the date '2021-06-25' #218

Closed
dk-sa1 opened this issue Jun 28, 2021 · 2 comments · Fixed by #223
Closed

batting_stats_range breaks when parsing through the date '2021-06-25' #218

dk-sa1 opened this issue Jun 28, 2021 · 2 comments · Fixed by #223
Labels

Comments

@dk-sa1
Copy link

dk-sa1 commented Jun 28, 2021

When using the batting_stats_range function, there is an issue when parsing through 6/25/2021.

When it breaks, you receive corrupt data. One such piece being the player José Abreu appearing as "José Abreu". As well as only receiving a couple rows of data ( As opposed to several hundred for a typical day of data).

Below are some code blocks that work and do not work.

Works:
data = batting_stats_range("2021-06-25", "2021-06-27")
data = batting_stats_range("2021-06-25", "2021-06-25")

Does NOT Work:
data = batting_stats_range("2021-06-24", "2021-06-27")
data = batting_stats_range("2021-06-24", "2021-06-25")

For some reason, you can start on 6/25 with no issues. But you cannot parse over, nor end on 6/25 without receiving corrupt data.

@schorrm schorrm added the bug label Jun 28, 2021
@dk-sa1
Copy link
Author

dk-sa1 commented Jun 28, 2021

The issue arises for the same date in 2019.

@bdilday
Copy link
Contributor

bdilday commented Jul 17, 2021

it looks like in the cases that the data gets truncated, beautiful soup can't use utf-8 so falls back on a different encoding, e.g.,

>>> from pybaseball.league_batting_stats import batting_stats_range, get_soup
>>> start_dt = end_dt = "2021-05-01"
>>> data = batting_stats_range(start_dt, end_dt)
>>> len(data)
334
>>> soup = get_soup(start_dt, end_dt)
>>> soup.original_encoding
'utf-8'
>>> 
>>> start_dt = end_dt = "2021-05-02"
>>> data = batting_stats_range(start_dt, end_dt)
>>> len(data)
10
>>> soup = get_soup(start_dt, end_dt)
>>> soup.original_encoding
'Windows-1252'
>>> 

this seems to happen when the page header or footer includes a link to https://fbref.com/es or https://fbref.de, because they include the characters ú and ß (in Fútbol and Fußball). so long story short this looks like inconsistent encoding between the header / footer and the main part of the page

because it doesnt depend on the data, the date where it happens isn't reproducible either

Closed by #223

dk-sa1 added a commit to dk-sa1/pybaseball that referenced this issue Mar 2, 2023
dk-sa1 added a commit to dk-sa1/pybaseball that referenced this issue Mar 31, 2023
dk-sa1 added a commit to dk-sa1/pybaseball that referenced this issue Mar 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants