Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated scraper_bot to handle scraping in chunks #2

Merged
merged 26 commits into from
Oct 20, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
c9cb368
Updated scraper_bot to handle scraping in chunks
TwoAbove Oct 12, 2023
04dffff
Added filter_messages
TwoAbove Oct 14, 2023
5c340e7
Added optional toggle to not download imges
TwoAbove Oct 15, 2023
56b72ea
Fixed type issue and added optional 'scrape all' env
TwoAbove Oct 15, 2023
ad55e82
Merge branch 'main' into update-scraper-to-handle-chunks
TwoAbove Oct 15, 2023
d398e6b
Merge branch 'main' into update-scraper-to-handle-chunks
TwoAbove Oct 15, 2023
ec71538
Fixed chunk update logic
TwoAbove Oct 16, 2023
a8485d6
Fixed update_chunk when empty repo
TwoAbove Oct 16, 2023
0fd28e4
Pinned huggingface_hub to at least 0.18
TwoAbove Oct 16, 2023
f40edae
Fixed race condition
TwoAbove Oct 16, 2023
bbb8ba4
Merge branch 'main' into update-scraper-to-handle-chunks
ZachNagengast Oct 19, 2023
209b6a4
Merge branch 'update-scraper-to-handle-chunks' of github.com:LAION-AI…
ZachNagengast Oct 19, 2023
3600241
Update append logic
ZachNagengast Oct 20, 2023
6959598
Fix config
ZachNagengast Oct 20, 2023
0d5a608
Cleanup
ZachNagengast Oct 20, 2023
50b2478
Update scraper/scraper_bot.py
ZachNagengast Oct 20, 2023
593ca10
Update scraper/scraper_bot.py
ZachNagengast Oct 20, 2023
e9b111b
Update scraper/scraper_bot.py
ZachNagengast Oct 20, 2023
f8fe5ea
Update scraper/scraper_bot.py
ZachNagengast Oct 20, 2023
dfe957a
Update fingerprint handling if missing
ZachNagengast Oct 20, 2023
2b29fdd
Freeze dataset version until fixed
ZachNagengast Oct 20, 2023
e17ca5e
Update dataset for test
ZachNagengast Oct 20, 2023
ff8542b
Freeze dataset version until fixed
ZachNagengast Oct 20, 2023
df94f2f
Add missing upload_file import
ZachNagengast Oct 20, 2023
54ce773
Test for dalle ci
ZachNagengast Oct 20, 2023
5ef842d
Rever test dataset names
ZachNagengast Oct 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
HF_TOKEN=
DISCORD_TOKEN=
DISCORD_TOKEN=
DATASET_CHUNK_SIZE=300
Copy link
Collaborator Author

@TwoAbove TwoAbove Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is derived empirically by looking at chunks from https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data

Copy link
Member

@ZachNagengast ZachNagengast Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this change as the dataset gets bigger? Wondering what effect changing it has.

Also I'd probably move this to config.json files because it doesn't need to be a secret

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, for some reason I thought that we would aggregate all of these datasets into one - that's not the case - so I'll change this. Thanks!

FETCH_ALL=
5 changes: 3 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,8 @@ ipython_config.py
# This is especially recommended for binary packages to ensure reproducibility, and is more
# commonly ignored for libraries.
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
#poetry.lock
poetry.lock
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use poetry as my python package manager and I didn't want to include that in this repo

pyproject.toml

# pdm
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
Expand Down Expand Up @@ -163,4 +164,4 @@ cython_debug/
.vscode

# macOS
.DS_Store
.DS_Store
1 change: 1 addition & 0 deletions helpers/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from helpers.helpers import *
40 changes: 40 additions & 0 deletions helpers/helpers.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from typing import Tuple


start_quotes = [
'"',
'“',
"'",
'«',
'„',
]

end_quotes = [
'"',
'”',
"'",
'»',
'“',
]


def starts_with_quotes(string: str) -> bool:
if len(string) == 0:
return False
return string[0] in start_quotes


def get_start_end_quotes(string: str) -> Tuple[str, str]:
first_quote_index = -1
last_quote_index = -1

for i, char in enumerate(string):
if first_quote_index != -1 and last_quote_index != -1:
break
if first_quote_index != -1 and char in start_quotes:
first_quote_index = i
continue
if last_quote_index != -1 and char in end_quotes:
first_quote_index = i

return (first_quote_index, last_quote_index)
4 changes: 3 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
requests==2.31.0
datasets==2.14.5
git+https://github.com/huggingface/datasets.git@a6bd7b4a268dbda6b86d4ca59f5d2a78848b0199
Pillow==10.0.1
huggingface_hub>=0.18
numpy
3 changes: 2 additions & 1 deletion scrape_dalle/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"base_url": "https://discord.com/api/v9",
"channel_id": "1158354590463447092",
"limit": 100,
"max_chunk_size": 300,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same "300" defined as DATASET_CHUNK_SIZE above? If yes, let's reuse it maybe?

Copy link
Collaborator Author

@TwoAbove TwoAbove Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the DATASET_CHUNK_SIZE env was my first iteration of the feature, but it makes more sense for it to be repo-dependent, so I think it should be deleted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the in-depth review! These comments are very helpful.

  • I'm not using pre-upload anymore since I moved the upload step into the append function, which uses the HfFileSystem to upload. If this is recommended I can take a look.
  • We store raw images only for some datasets, based on the config.

Here is one with images https://huggingface.co/datasets/laion/dalle-3-dataset. I still need get the readme updated as well for the dataset viewer here

"embed_images": true,
"hf_dataset_name": "laion/dalle-3-dataset"
}
}
20 changes: 10 additions & 10 deletions scrape_dalle/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
sys.path.append("..")

from scraper import ScraperBot, ScraperBotConfig, HFDatasetScheme
from helpers import starts_with_quotes, get_start_end_quotes


def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
"""Parses a message into a list of Hugging Face Dataset Schemes.
Expand All @@ -20,22 +22,19 @@ def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
A list of Hugging Face Dataset Schemes.
"""
content = message["content"]

# Find the index of the first quote in the content
first_quote_index = content.find('"')

# Find the index of the last quote in the content
last_quote_index = content.rfind('"')


(first_quote_index, last_quote_index) = get_start_end_quotes(content)

# Extract the text between the first and last quotes to get the complete prompt
prompt = content[first_quote_index + 1:last_quote_index].strip()
image_urls = [attachment["url"] for attachment in message["attachments"]]
timestamp = message["timestamp"]
message_id = message["id"]

return [HFDatasetScheme(caption=prompt, image=None, link=image_url, message_id=message_id, timestamp=timestamp)
return [HFDatasetScheme(caption=prompt, image=None, link=image_url, message_id=message_id, timestamp=timestamp)
for image_url in image_urls]


def condition_fn(message: Dict[str, Any]) -> bool:
"""Checks if a message meets the condition to be parsed.

Expand All @@ -49,11 +48,12 @@ def condition_fn(message: Dict[str, Any]) -> bool:
bool
True if the message meets the condition, False otherwise.
"""
return len(message["attachments"]) > 0 and message["content"].startswith('"')
return len(message["attachments"]) > 0 and starts_with_quotes(message["content"])


if __name__ == "__main__":
config_path = os.path.join(os.path.dirname(__file__), "config.json")
config = ScraperBotConfig.from_json(config_path)

bot = ScraperBot(config=config, parse_fn=parse_fn, condition_fn=condition_fn)
bot.scrape(fetch_all=True)
bot.scrape(fetch_all=os.environ.get("FETCH_ALL", "false").lower() == "true")
3 changes: 2 additions & 1 deletion scrape_gpt4v/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"base_url": "https://discord.com/api/v9",
"channel_id": "1159217496390389801",
"limit": 100,
"max_chunk_size": 300,
"embed_images": false,
"hf_dataset_name": "laion/gpt4v-dataset"
}
}
17 changes: 8 additions & 9 deletions scrape_gpt4v/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from typing import Any, Dict, List

from scraper import ScraperBot, ScraperBotConfig, HFDatasetScheme
from helpers import starts_with_quotes, get_start_end_quotes

url_pattern = re.compile(r'https?://\S+')

Expand All @@ -20,13 +21,9 @@ def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
A list of Hugging Face Dataset Schemes.
"""
content = message["content"]

# Find the index of the first quote in the content
first_quote_index = content.find('"')

# Find the index of the last quote in the content
last_quote_index = content.rfind('"')


(first_quote_index, last_quote_index) = get_start_end_quotes(content)

# Extract the text between the first and last quotes to get the complete prompt
prompt = content[first_quote_index + 1:last_quote_index].strip()
image_urls = url_pattern.findall(content)
Expand All @@ -36,6 +33,7 @@ def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
return [HFDatasetScheme(caption=prompt, image=None, link=image_url, message_id=message_id, timestamp=timestamp)
for image_url in image_urls]


def condition_fn(message: Dict[str, Any]) -> bool:
"""Checks if a message meets the condition to be parsed.

Expand All @@ -49,11 +47,12 @@ def condition_fn(message: Dict[str, Any]) -> bool:
bool
True if the message meets the condition, False otherwise.
"""
return url_pattern.search(message["content"]) and message["content"].startswith('"')
return url_pattern.search(message["content"]) and starts_with_quotes(message["content"])


if __name__ == "__main__":
config_path = os.path.join(os.path.dirname(__file__), "config.json")
config = ScraperBotConfig.from_json(config_path)

bot = ScraperBot(config=config, parse_fn=parse_fn, condition_fn=condition_fn)
bot.scrape(fetch_all=True)
bot.scrape(fetch_all=os.environ.get("FETCH_ALL", "false").lower() == "true")
1 change: 1 addition & 0 deletions scrape_gpt4v_emotion/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"base_url": "https://discord.com/api/v9",
"channel_id": "1162094554472788029",
"limit": 100,
"max_chunk_size": 300,
"embed_images": false,
"hf_dataset_name": "laion/gpt4v-emotion-dataset"
}
11 changes: 4 additions & 7 deletions scrape_gpt4v_emotion/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from typing import Any, Dict, List

from scraper import ScraperBot, ScraperBotConfig, HFDatasetScheme
from helpers import starts_with_quotes, get_start_end_quotes

url_pattern = re.compile(r'https?://\S+')

Expand All @@ -22,11 +23,7 @@ def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
"""
content = message["content"]

# Find the index of the first quote in the content
first_quote_index = content.find('"')

# Find the index of the last quote in the content
last_quote_index = content.rfind('"')
(first_quote_index, last_quote_index) = get_start_end_quotes(content)

# Extract the text between the first and last quotes to get the complete prompt
prompt = content[first_quote_index + 1:last_quote_index].strip()
Expand All @@ -51,12 +48,12 @@ def condition_fn(message: Dict[str, Any]) -> bool:
bool
True if the message meets the condition, False otherwise.
"""
return url_pattern.search(message["content"]) and message["content"].startswith('"')
return url_pattern.search(message["content"]) and starts_with_quotes(message["content"])


if __name__ == "__main__":
config_path = os.path.join(os.path.dirname(__file__), "config.json")
config = ScraperBotConfig.from_json(config_path)

bot = ScraperBot(config=config, parse_fn=parse_fn, condition_fn=condition_fn)
bot.scrape(fetch_all=True)
bot.scrape(fetch_all=os.environ.get("FETCH_ALL", "false").lower() == "true")
3 changes: 2 additions & 1 deletion scrape_wuerstchen/config.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
"base_url": "https://discord.com/api/v9",
"channel_id": "1161398740595265626",
"limit": 100,
"max_chunk_size": 300,
"embed_images": true,
"hf_dataset_name": "laion/wuerstchen-dataset"
}
}
9 changes: 6 additions & 3 deletions scrape_wuerstchen/scrape.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@

from scraper import ScraperBot, ScraperBotConfig, HFDatasetScheme


def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
"""Parses a message into a list of Hugging Face Dataset Schemes.

Expand All @@ -21,9 +22,10 @@ def parse_fn(message: Dict[str, Any]) -> List[HFDatasetScheme]:
timestamp = message["timestamp"]
message_id = message["id"]

return [HFDatasetScheme(caption=prompt, image=None, link=image_url, message_id=message_id, timestamp=timestamp)
return [HFDatasetScheme(caption=prompt, image=None, link=image_url, message_id=message_id, timestamp=timestamp)
for image_url in image_urls]


def condition_fn(message: Dict[str, Any]) -> bool:
"""Checks if a message meets the condition to be parsed.

Expand All @@ -37,11 +39,12 @@ def condition_fn(message: Dict[str, Any]) -> bool:
bool
True if the message meets the condition, False otherwise.
"""
return len(message["attachments"]) > 0
return len(message["attachments"]) > 0


if __name__ == "__main__":
config_path = os.path.join(os.path.dirname(__file__), "config.json")
config = ScraperBotConfig.from_json(config_path)

bot = ScraperBot(config=config, parse_fn=parse_fn, condition_fn=condition_fn)
bot.scrape(fetch_all=True)
bot.scrape(fetch_all=os.environ.get("FETCH_ALL", "false").lower() == "true")
27 changes: 27 additions & 0 deletions scraper/dataset_readme_template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
---
dataset_info:
features:
- name: caption
dtype: string
- name: image
dtype: image
- name: link
dtype: string
- name: message_id
dtype: string
- name: timestamp
dtype: string
splits:
- name: train
num_bytes: 0
num_examples: 0
download_size: 0
dataset_size: 0
configs:
- config_name: default
data_files:
- split: train
path: data/train-*
---

Use the Edit dataset card button to edit.
Loading