-
Notifications
You must be signed in to change notification settings - Fork 0
/
TODO.txt
80 lines (61 loc) · 2.44 KB
/
TODO.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# TODO workplan
## Good Data
Currently `overseas` and `redo` scrape data is good.
Older scraped data may be okay now that overseas comanies was scraped again recently.
This why it is important to rebuild the database again and prepare for continuous scraping to update the database at any time.
- [x] Handle upgrading entity stub to a fully detailed entity
This can be done with an upsert at the beginning of the script that fills the database.
So later scraping sessions may perhaps locate an entity then replace a stub with complete scraped details.
Also, overtime new scraping sessions will replace old versions of known entities.
## BAD Data
- [x] Scrape local companies again (in progress; with failures)
Appears early scrapes have too much bad data.
I can scrape again and have all good results.
For now I should focus on tools to help process and prepare the database then tools to get data into format for presentation.
- [x] write to DB last_seen value to know how long since a company was scraped.
This identifies failed scrapes and old data; allows us to make less costly scraping
- [ ] read list view of company names for purpose of skipping listings and skipping pages
## 2024 08 03 WIP
vfsc spider >>> scraper -> JSONL
process.py >>> JSONL -> distinct -> pkl
test-sqlite.py >>> pkl -> db
- [ ] Need periodic scrape
- target company
- specific query
- [x] company meta-data
- last_seen
- created_at
- updated_at - when a change is detected, different than last_seen
- [x] home page sections
- FAQ
- how its made
- random node
- [ ] additional list pages
- [x] newly registered companies
- [ ] recently removed/dissolved/liquidation/etc.
- [x] latest updated_at
- [x] oldest companies (still registered)
- [x] popular nodes
- [x] recently visited nodes
- [ ] significant nodes
- large companies (requires outside data)
- gov't MPs (requires outside data)
- [x] Setup DB
- In pipeline to spider
- Store history & current state of world
- [x] Track usage and queries
- separate web db?
- [x] Cache
- [x] home page
- [x] random node
- [x] list pages
- [x] more attributes on nodes
- company number
- Registration date
- entity_type
- addresses
---
Deploy production
- uwsgi does not work for our usecase
- we require a new base docker image perhaps with gunicorn to support threads
careful gunicorn each process has its own local memory and I am using a local queue so I can only support one process