Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data scraper, collected data #4

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Noxxxxious
Copy link

No description provided.

scrapers/declension.py Show resolved Hide resolved
scrapers/declension.py Outdated Show resolved Hide resolved
scrapers/declension.py Outdated Show resolved Hide resolved
scrapers/declension.py Outdated Show resolved Hide resolved
scrapers/declension.py Outdated Show resolved Hide resolved
scrapers/declension.py Outdated Show resolved Hide resolved
if entry['partOfSpeech'] != "NOUN":
return

for meaning in meanings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use helper functions, Python code shouldn't be as nested.

return {}

html_content = response.content
soup = BeautifulSoup(html_content, 'html.parser')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
soup = BeautifulSoup(html_content, 'html.parser')
parsed_html = BeautifulSoup(html_content, 'html.parser')

soup = BeautifulSoup(html_content, 'html.parser')
declension_table = soup.find('table')

if not declension_table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print("ERROR: Failed to find declension table in parsed html")

url = f"{POLISH_NOUN_API_URL}{quoted_noun}"
response = requests.get(url)

if response.status_code != 200:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print(f"ERROR: Received status code {response.status_code}")

'Wołacz': 'vocative'
}

rows = declension_table.find_all('tr')[1:] # skip header row
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from bs4.element import Tag

...

def get_table_rows_excluding_header(table: Tag) -> Tag:
    return table.find_all("tr")[1:]
Suggested change
rows = declension_table.find_all('tr')[1:] # skip header row
rows = get_table_rows_excluding_header(declension_table)

}

case_mapping = {
'Mianownik': 'nominative',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace all ' with ".

if entry['partOfSpeech'] != "NOUN":
return

for word_meaning in word_meanings:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create KashubianEntryParser with following methods:
parse_word_meanings
parse_word_polish_translations
parse_noun_variations
parse_word_variations
parse_word_variation_variations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants