Skip to content

Commit

Permalink
Merge pull request #36 from miha42-github/V3.0.0--dev
Browse files Browse the repository at this point in the history
V3.0.0  dev
  • Loading branch information
miha42-github authored Apr 16, 2024
2 parents 5a1a591 + 32d643d commit 317d6fe
Show file tree
Hide file tree
Showing 37 changed files with 1,417 additions and 1,444 deletions.
34 changes: 34 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
name: Monthly company_dns build

on:
push:
branches:
- main

jobs:
build:
runs-on: ubuntu-latest

steps:
- name: Checkout code
uses: actions/checkout@v2
with:
repository: 'miha42-github/company_dns'
ref: 'main'

- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v1

- name: Login to GitHub Container Registry
uses: docker/login-action@v1
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Build and push
uses: docker/build-push-action@v2
with:
context: .
push: true
tags: ghcr.io/${{ github.repository }}/company_dns:latest
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
*.gz
*.db
.db_exists
db.exists
__pycache__
.DS_Store
test.py
.cache_exists
form_*
cache.exists
form_*
edgar_data
24 changes: 24 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Use an official Python runtime as a parent image
FROM python:3.12-slim

# Set the working directory in the container to /app
WORKDIR /app

# Add the current directory contents into the container at /app
ADD . /app

# Install curl and create directory
RUN apt-get update && apt-get install -y curl && mkdir -p /app/edgar_data
# RUN apk --no-cache add curl && mkdir -p /app/edgar_data

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Run makedb.py to create the database cache
RUN python makedb.py

# Make port 80 available to the world outside this container
EXPOSE 8000

# Run the command to start the application
CMD ["python", "company_dns.py"]
36 changes: 15 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
# Introduction to the company_dns
To enable a more automated approach to gathering information about companies `company_dns` was created. This release enables the synthesis of data from the [SEC EDGAR repository](https://www.sec.gov/edgar/searchedgar/companysearch.html) and [Wikipedia](https://wikipedia.org). A [Medium](https://medium.com) article entitled "[A case for API based open company firmographics](https://medium.com/@michaelhay_90395/a-case-for-api-based-open-company-firmographics-145e4baf121b)" is available discussing the process and motivation behind the creation of this service.

# Introducing V3.0.0
The V3.0.0 release of the `company_dns` is a significant update to the service. The primary changes are:
1. Shift from Flask to Starlette with Uvicorn
2. Automated monthly container builds, from the main branch of the repository, using GitHub Actions
3. Simplification of all aspects of the service including code structure, shift towards simpler Docker, and a more streamlined service control script
4. Vastly improved embedded help with a query console to test queries
We were motivated to make these changes to the service making it easier to improve, maintain and use.


# Installation & Setup
The follwing basic steps are provided for the purposes of getting the tool running.
## Get the code
## For developers Get the code
Assming you have setup access to GitHub, you'll need to clone the repository. Here we assume you're on a Linux box of some kind and will follow the steps below.

1. If you're performing development create a directory that will contain the code: `mkdir ~/dev`
Expand All @@ -15,9 +24,6 @@ Before you get started it is important to install all prequisites and then creat

1. Enter the directory with the service bits (assuming you're using ~/dev): `cd ~/dev/company_dns/company_dns`
2. Install all prerequsites: `pip3 install -r ./requirements.txt`
3. Change the `USER_AGENT` setting in `~/dev/company_dns/company_dns/app/pyedgar.conf` to your own user agent definition. If you don't the SEC downloads will fail.

The utility `dbcontrol.py` will download EDGAR data, process it, and then create a database for the `company_dns`. Note that you do not need to directly run this utility as the service control script will handle it for you. For more information on the database control utility please checkout the [readme](company_dns/app/README.md) for it.

## Service Control Script
A service control script, `svc_ctl.sh` is provided to wrap build, run, and log tailing functions as of V2.3.0. Compared to past versions this script significantly simplifies working with the `company_dns` removing many manual steps to getting it running. As a result there is only one step needed to get the service running `cd ~dev/company_dns;svc_ctl.sh up`. This script will:
Expand All @@ -36,15 +42,11 @@ DESCRIPTION:
Control functions to run the company_dns
COMMANDS:
help up down start stop create_db build delete_db foreground tail
help start stop build foreground tail
help - call up this help text
up - bring up the service including building and pulling the docker image
down - bring down the service and remove the docker image
start - start the service using docker-compose
stop - stop the docker service
create_db - create a new database cache for the company_dns
delete_db - delete the database cache for the company_dns
build - build the docker images for the server
foreground - run the server in the foreground to watch for output
tail - tail the logs for a server running in the background
Expand All @@ -53,7 +55,6 @@ COMMANDS:
## Verify that the service is working
Regardless of the approach you've taken to run the `company_dns` checking to see if it is operating is important. Therefore you can point a browser to the server running the service. If you're running on localhost then the following link should work [http://localhost:6868/V2.0/help](http://localhost:6868/V2.0/help) however if you're on another server then you'll need to change the server name to the one you're using. If this is successful you will be able to see the embedded help which describes the available set of endpoints, and provides and example query to the service. A screenshot of the help screen can be found below.

![Screen Shot 2022-10-16 at 8 18 57 PM](https://user-images.githubusercontent.com/10818650/196084425-6fd9d724-1f59-4eed-9548-c553168bf387.png)

## Checkout a live system
We're hosting an instance of the `company_dns` on our website for our usage and your exploration. Below are several example queries and access to embedded help to get you a better view of the system.
Expand All @@ -71,20 +72,14 @@ We try to keep high level Todos and Improvements in a list contained in a sectio

### Future work/Todos
Here are the things that are likely to be worked but without any strict deadline:
1. ~~Create a simple wrapping script to operationalize service behaviors~~ [see issue #4](https://github.com/miha42-github/company_dns/issues/4)
2. ~~Incrementally refactor the repository and the code~~
3. ~~Enable TLS on nginx or provide instructions to do so~~, [see issue #10](https://github.com/miha42-github/company_dns/issues/10)
4. Determine if feasible to talk to the companies house API for gathering data from the UK
5. Research other pools of public data which can serve to enrich
6. Evaluate if financial data can be added from EDGAR, Wikipedia and Companies House
7. ~~Clean up stale EDGAR URLs~~
8. Provide instructions/details for running on a Pi or Arm based system, see Lagniappe below
9. ~~Update README.md with the appropriate language, etc.~~, [see issue #9](https://github.com/miha42-github/company_dns/issues/9)
10. ~~Add additional URLs for news, stock, patents, etc. to the merged response~~, [see issue #11](https://github.com/miha42-github/company_dns/issues/11)
11. ~~Add ticker information from Wikipedia into the response~~, [see issue #7](https://github.com/miha42-github/company_dns/issues/7)


### The Lagniappe
If you would like to run this on a RasberryPi I'll be adding a couple of configuration files and appropriate instructions later, but until then I suggest you check out [Matt's](https://www.raspberrypi-spy.co.uk/author/matt/) guide to [getting Nginx, UWSGI and Flask running on a Pi](https://www.raspberrypi-spy.co.uk/2018/12/running-flask-under-nginx-raspberry-pi/). At some point if someone would like to create a docker image for these elements running on the Pi that would be great.
Run on a RasberryPi: To be reauthored


# License
Expand All @@ -93,8 +88,7 @@ Since this code falls under a liberal Apache-V2 license it is provided as is, wi
# Key Dependencies
- [PyEdgar](https://github.com/gaulinmp/pyedgar) - used to interface with the SEC's EDGAR repository
- [SQLite](https://www.sqlite.org/index.html) - helps all utilities and the RESTful service quickly and expressively respond to interactions with the other elements to find appropriate company data
- [Flask](https://www.palletsprojects.com/p/flask/) and associated utilities - used to realize the RESTful service
- [nginx](http://nginx.org) - enables hosting of the RESTful service
- Docker & Docker Compose - Container and server framework
- [Starlette](https://www.starlette.io) - used to create the RESTful service
- [Uvicorn](https://www.uvicorn.org) - used to run the RESTful service
- [GeoPy with ArcGIS](https://github.com/geopy/geopy) - Enables proper address formatting and reporting of lat-long pairs for companies
- [wptools](https://github.com/siznax/wptools/) - provides access to MediaWiki data for company search
12 changes: 7 additions & 5 deletions company_dns/app/company_dns.conf → company_dns.conf
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
[db_control]
DB_NAME = company_dns.db
CACHE_EXISTS = .cache_exists
DB_EXISTS = .db_exists

[edgar_data]
EDGAR_DATA_DIR = ./edgar_data
ALL_FORMS = form_all.tab
CACHE_EXISTS = cache.exists

[sic_data]
SIC_DATA_DIR = ./sic_data
DIVISIONS = divisions.csv
MAJOR_GROUPS = major-groups.csv
INDUSTRY_GROUPS = industry-groups.csv
SIC_CODES = sic-codes.csv

[db_control]
DB_NAME = company_dns.db
DB_EXISTS = db.exists
DB_PATH = ./
200 changes: 200 additions & 0 deletions company_dns.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
from starlette.applications import Starlette
from starlette.responses import JSONResponse, RedirectResponse
from starlette.routing import Route
from starlette.routing import Mount
from starlette.staticfiles import StaticFiles
from starlette.middleware import Middleware
from starlette.middleware.cors import CORSMiddleware
from starlette.middleware.base import BaseHTTPMiddleware
import uvicorn
import logging

from lib.sic import SICQueries
from lib.edgar import EdgarQueries
from lib.wikipedia import WikipediaQueries
from lib.firmographics import GeneralQueries

# -------------------------------------------------------------- #
# BEGIN: Standard Idustry Classification (SIC) database cache functions
async def sic_description(request):
return _handle_request(request, sq, sq.get_all_sic_by_name, 'sic_desc')

async def sic_code(request):
return _handle_request(request, sq, sq.get_all_sic_by_no, 'sic_code')

async def division_code(request):
return _handle_request(request, sq, sq.get_division_desc_by_id, 'division_code')

async def industry_code(request):
return _handle_request(request, sq, sq.get_all_industry_group_by_no, 'industry_code')

async def major_code(request):
return _handle_request(request, sq, sq.get_all_major_group_by_no, 'major_code')
# END: Standard Idustry Classification (SIC) database cache functions
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# BEGIN: EDGAR dabase cache functions
async def edgar_detail(request):
return _handle_request(request, eq, eq.get_all_details, 'company_name')

async def edgar_summary(request):
return _handle_request(request, eq, eq.get_all_details, 'company_name', firmographics=False)

async def edgar_ciks(request):
return _handle_request(request, eq, eq.get_all_ciks, 'company_name')

async def edgar_firmographics(request):
return _handle_request(request, eq, eq.get_firmographics, 'cik_no')
# END: EDGAR dabase cache functions
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# BEGIN: Wikipedia functions
async def wikipedia_firmographics(request):
return _handle_request(request, wq, wq.get_firmographics, 'company_name')
# END: Wikipedia functions
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# BEGIN: General query functions
async def general_query(request):
try:
gq.query = request.path_params['company_name']
# Log the query request as a debug message
logger.debug(f'Querying for general data for {request.path_params["company_name"]}')
company_wiki_data = gq.get_firmographics_wikipedia()
general_company_data = gq.merge_data(company_wiki_data['data'], company_wiki_data['data']['cik'])
# Call check_status_and_return to check the status of the data and return the data or an error message
checked_data = _check_status_and_return(general_company_data, request.path_params['company_name'])
if 'error' in checked_data:
return JSONResponse(checked_data, status_code=checked_data['code'])
return JSONResponse(checked_data)
except Exception as e:
logger.error(f'Error: {e}')
general_company_data = {'error': 'A general or code error has occured', 'code': 500}
return JSONResponse(general_company_data, status_code=general_company_data['code'])
# END: General query functions
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# BEGIN: Helper functions
def _check_status_and_return(data, resource_name):
if data.get('code') != 200:
# Log the error message
logger.error(f'Data for resource {resource_name} not found')
# Return an error message that the data was not found using the resource name
return {'error': f'Data for resource {resource_name} not found', 'code': 404}
return data

def _prepare_logging(log_level=logging.DEBUG):
logging.basicConfig(format='%(levelname)s:\t%(asctime)s [module: %(name)s] %(message)s', level=log_level)
return logging.getLogger(__file__)

def _handle_request(request, handler, func, path_param, *args, **kwargs):
handler.query = request.path_params.get(path_param)
data = func(*args, **kwargs)
checked_data = _check_status_and_return(data, path_param)
if 'error' in checked_data:
return JSONResponse(checked_data, status_code=checked_data['code'])
return JSONResponse(data)
# END: Helper functions
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# BEGIN: Define query objects
global sq
sq = SICQueries()

global eq
eq = EdgarQueries()

global wq
wq = WikipediaQueries()

global gq
gq = GeneralQueries()
# END: Define query objects
# -------------------------------------------------------------- #


# -------------------------------------------------------------- #
# BEGIN: Define the Starlette app
class CatchAllMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
response = await call_next(request)
if response.status_code == 404:
return RedirectResponse(url='/help')
return response

middleware = [
Middleware(CatchAllMiddleware),
Middleware(CORSMiddleware, allow_origins=['*'])
]

global logger
logger = _prepare_logging()

app = Starlette(debug=True, middleware=middleware, routes=[
# -------------------------------------------------------------- #
# SIC endpoints for V2.0
Route('/V2.0/sic/description/{sic_desc}', sic_description),
Route('/V2.0/sic/code/{sic_code}', sic_code),
Route('/V2.0/sic/division/{division_code}', division_code),
Route('/V2.0/sic/industry/{industry_code}', industry_code),
Route('/V2.0/sic/major/{major_code}', major_code),

# SIC endpoints for V3.0
Route('/V3.0/na/sic/description/{sic_desc}', sic_description),
Route('/V3.0/na/sic/code/{sic_code}', sic_code),
Route('/V3.0/na/sic/division/{division_code}', division_code),
Route('/V3.0/na/sic/industry/{industry_code}', industry_code),
Route('/V3.0/na/sic/major/{major_code}', major_code),
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# EDGAR endpoints for V2.0
Route('/V2.0/companies/edgar/detail/{company_name}', edgar_detail),
Route('/V2.0/companies/edgar/summary/{company_name}', edgar_summary),
Route('/V2.0/companies/edgar/ciks/{company_name}', edgar_ciks),
Route('/V2.0/company/edgar/firmographics/{cik_no}', edgar_firmographics),

# EDGAR endpoints for V3.0
Route('/V3.0/na/companies/edgar/detail/{company_name}', edgar_detail),
Route('/V3.0/na/companies/edgar/summary/{company_name}', edgar_summary),
Route('/V3.0/na/companies/edgar/ciks/{company_name}', edgar_ciks),
Route('/V3.0/na/company/edgar/firmographics/{cik_no}', edgar_firmographics),
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# Wikipedia endpoints for V2.0
Route('/V2.0/company/wikipedia/firmographics/{company_name}', wikipedia_firmographics),

# Wikipedia endpoints for V3.0
Route('/V3.0/global/company/wikipedia/firmographics/{company_name}', wikipedia_firmographics),
# -------------------------------------------------------------- #

# -------------------------------------------------------------- #
# General query endpoint for V2.0
Route('/V2.0/company/merged/firmographics/{company_name}', general_query),

# General query endpoint for V3.0
Route('/V3.0/global/company/merged/firmographics/{company_name}', general_query),
# -------------------------------------------------------------- #

# Serve the local directory ./html at the /help
Mount('/help', app=StaticFiles(directory='html', html=True)),

# Catch-all route which redirects to /help
# Route("/{path:path}", endpoint=lambda _: RedirectResponse(url='/help'), methods=["GET"]),


])
# END: Define the Starlette app
# -------------------------------------------------------------- #

if __name__ == "__main__":
try:
uvicorn.run(app, host='0.0.0.0', port=8000, log_level="debug", lifespan='off')
except KeyboardInterrupt:
logger.info("Server was shut down by the user.")
4 changes: 0 additions & 4 deletions company_dns/Dockerfile

This file was deleted.

Loading

0 comments on commit 317d6fe

Please sign in to comment.