Skip to content

Web crawler to get relevant information about Census website parent and child topics.

Notifications You must be signed in to change notification settings

CommerceDataService/census_topic_crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

census_topic_crawler

U.S. Census Bureau logo

This project attempts to get relevant information about Census website topics and enhance a elasticsearch index that contains a set of similar topic names.

Specifically, there are two main areas of this repository. The first is:

Topic Crawler

A spider for crawling Census website topic pages in order to grab relevant information and output that into a formatted JSON document.

There are two spiders (crawlers) that for crawling the Census topic pages.

  • parent_topics (filename: census_topic_crawler.py)
  • child_topics (filename: census_child_topic_crawler.py)

In order to execute each of these spiders, run the following command in the base folder of the repository:

scrapy crawl [name of spider] -o [name of output file] -t json

When the above command is executed, the output will be written to the specified filename.

Note: the format of the site may change. If the layout changes, it is not guaranteed that this bot completely execute without errors.

Topic Information Loader

Some scripts to load that information into a new field in an existing topic index and query that index to test the results of the changes. These files are contained in the elastic_scripts directory of the repository (along with an additional README).

About

Web crawler to get relevant information about Census website parent and child topics.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages