-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updated scraper_bot to handle scraping in chunks #2
Changes from all commits
c9cb368
04dffff
5c340e7
56b72ea
ad55e82
d398e6b
ec71538
a8485d6
0fd28e4
f40edae
bbb8ba4
209b6a4
3600241
6959598
0d5a608
50b2478
593ca10
e9b111b
f8fe5ea
dfe957a
2b29fdd
e17ca5e
ff8542b
df94f2f
54ce773
5ef842d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
HF_TOKEN= | ||
DISCORD_TOKEN= | ||
DISCORD_TOKEN= | ||
DATASET_CHUNK_SIZE=300 | ||
FETCH_ALL= |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -99,7 +99,8 @@ ipython_config.py | |
# This is especially recommended for binary packages to ensure reproducibility, and is more | ||
# commonly ignored for libraries. | ||
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control | ||
#poetry.lock | ||
poetry.lock | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I use poetry as my python package manager and I didn't want to include that in this repo |
||
pyproject.toml | ||
|
||
# pdm | ||
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. | ||
|
@@ -163,4 +164,4 @@ cython_debug/ | |
.vscode | ||
|
||
# macOS | ||
.DS_Store | ||
.DS_Store |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from helpers.helpers import * |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
from typing import Tuple | ||
|
||
|
||
start_quotes = [ | ||
'"', | ||
'“', | ||
"'", | ||
'«', | ||
'„', | ||
] | ||
|
||
end_quotes = [ | ||
'"', | ||
'”', | ||
"'", | ||
'»', | ||
'“', | ||
] | ||
|
||
|
||
def starts_with_quotes(string: str) -> bool: | ||
if len(string) == 0: | ||
return False | ||
return string[0] in start_quotes | ||
|
||
|
||
def get_start_end_quotes(string: str) -> Tuple[str, str]: | ||
first_quote_index = -1 | ||
last_quote_index = -1 | ||
|
||
for i, char in enumerate(string): | ||
if first_quote_index != -1 and last_quote_index != -1: | ||
break | ||
if first_quote_index != -1 and char in start_quotes: | ||
first_quote_index = i | ||
continue | ||
if last_quote_index != -1 and char in end_quotes: | ||
first_quote_index = i | ||
|
||
return (first_quote_index, last_quote_index) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,5 @@ | ||
requests==2.31.0 | ||
datasets==2.14.5 | ||
git+https://github.com/huggingface/datasets.git@a6bd7b4a268dbda6b86d4ca59f5d2a78848b0199 | ||
Pillow==10.0.1 | ||
huggingface_hub>=0.18 | ||
numpy |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
"base_url": "https://discord.com/api/v9", | ||
"channel_id": "1158354590463447092", | ||
"limit": 100, | ||
"max_chunk_size": 300, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this the same "300" defined as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the in-depth review! These comments are very helpful.
Here is one with images https://huggingface.co/datasets/laion/dalle-3-dataset. I still need get the readme updated as well for the dataset viewer here |
||
"embed_images": true, | ||
"hf_dataset_name": "laion/dalle-3-dataset" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
--- | ||
dataset_info: | ||
features: | ||
- name: caption | ||
dtype: string | ||
- name: image | ||
dtype: image | ||
- name: link | ||
dtype: string | ||
- name: message_id | ||
dtype: string | ||
- name: timestamp | ||
dtype: string | ||
splits: | ||
- name: train | ||
num_bytes: 0 | ||
num_examples: 0 | ||
download_size: 0 | ||
dataset_size: 0 | ||
configs: | ||
- config_name: default | ||
data_files: | ||
- split: train | ||
path: data/train-* | ||
--- | ||
|
||
Use the Edit dataset card button to edit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is derived empirically by looking at chunks from https://huggingface.co/datasets/laion/dalle-3-dataset/tree/main/data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this change as the dataset gets bigger? Wondering what effect changing it has.
Also I'd probably move this to config.json files because it doesn't need to be a secret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, for some reason I thought that we would aggregate all of these datasets into one - that's not the case - so I'll change this. Thanks!