This program crawls fashion items from the four sites.
First, get the items introduction page.
Second, go to each item page and crawl the title, image, reviews, and price using celery
Third, categorize image style using fashion recommend mall deep learning server
After all, save all data in the Mongo DB atlas
키작녀 | 키작남 | 소녀나라 | 고고싱 |
---|---|---|---|
{
"site" : crawling site,
"category" : item category(top, pants ...),
"title" : item title,
"image_link" : item image link,
"price" : item price,
"reviews" : item review kewords(list[str]),
"style" : item stytle(first, second)
}
Docker : 20.10.16
# First, Set like below in project_setting.py
url_setting = {
"mongo_db_url" : "mongo db atlas url",
"selenium_url" : "driver path",
"deep_learning_server_url" : "deep learning server url"
}
tool_setting = {
"database_driver" : MongodbContextManager,
"web_driver" : SeleniumContextManager
}
celery_broker_url = {
"celery_broker_url" : "celery broker url"
}
// setting celery worker
$ sudo git clone {this repo}
$ sudo docker build -t crawler .
$ sudo docker run -d crawler
// run app.py
python3 app.py
Celery is an asynchronous task queue and multi-task processing method of queuing a series of tasks. Celery is often used when converting and storing files on a synchronously performed web or performing heavy tasks such as file uploads.
- Asynchronous task queue, capable of scheduling but focused on real-time processing.
- Synchronous/Asynchronous processing is possible.
- The unit of work is called Task, and the worker is called Worker.
- Use a message broker like RabbitMQ, or Redis.
pip3 install celery
celery -A tasks worker -l INFO
"""
This code is default using of celery
"""
import time
import random
import celery
# Set celery like blew
app = celery.Celery{
'tasks',
broker='pyamqp://broker-url',
backend='pyamqp://backend-url'
}
# Add anotation app.task
@app.task
def build_server():
print('wait 10 sec')
time.sleep(10)
server_id = random.randit(1,10)
return server_id
"""
Group handles celery tasks by grouping
"""
@app.task
def build_servers():
# call celery group and set tasks parameter
g = celery.group(
build_server.s() for _ in range(10))
return g()
"""
Chord runs a callback task after all group tasks done
"""
@app.task
def callback(result):
for server_id in result:
print(server_id)
print("done")
return "done"
@app.task
def build_server_with_callback():
c = celery.chord(
# run group task
build_server.s() for _ in range(10),
# After group task, run callbak
callback.s()
)
return c()
"""
Chain connects each task.
"""
@app.task
def setup_dns(server_id):
print(server_id)
return "done"
@app.task
def deploy_costomer_server():
chain = build.server.s() | setup_dns.s()
return chain()
Chord is connecting Goup and Callback. However Chain is connecting each tasks!
Chord performs a synchronization to connect Group and Callback. this process uses a lot of resource.
So use chains instead of chords as possible