- Implemented a web crawler that fetches different cars’ data from a car information website
- Store structured cars’ data into
ElasticSearch
that runs as a web service in a Docker - Implemented a simple webpage that read data from
ElasticSearch
and display in a data grid - Used Rpc for data communication among engine instance,
ItemSaver
instance, and distributedWorker
’s instances
- Official website: https://golang.org/. And download the installation package from the website and install.
- Confirm
go
has been successfully installed by typing inYus-MacBook-Pro:~ yyu196$ go version go version go1.13.7 darwin/amd64
- Turn on
GO111MODULE
and installgolangimports
dependency by typing inYus-MacBook-Pro:~ yyu196$ go env -w GO111MODULE=on Yus-MacBook-Pro:~ yyu196$ go get -v golang.org/x/tools/cmd/goimports
- Install all
go
related extensions in Visual Studio Code. - Create
go.mod
file under./
by typing inSuppose we have a simple helloworld go file calledYus-MacBook-Pro:GoLangIntro yyu196$ go mod init FundamentalGrammer
FundamentalGrammer
directory, and we can run the file under./
which containsgo.mod
by typingYus-MacBook-Pro:GoLangIntro yyu196$ go run FundamentalGrammer/basic.go Hello World
- Setup proxy in Mainland China. Go to https://github.com/goproxy/goproxy.cn/blob/master/README.zh-CN.md. Default proxy is
GOPROXY="https://proxy.golang.org,direct"
. Type ingo env -w GOPROXY=https://goproxy.cn,direct
- Install gin framework by entering command
go get -u github.com/gin-gonic/gin
. - Install zap library entering command
go get -u go.uber.org/zap
-
Docker is written in
Go
programming language, directly go tohttps://hub.docker.com/editions/community/docker-ce-desktop-mac
to download stable version binary of the Docker software. -
In mainland China, we need to use a new registry to pull
Docker
's mirror. InDocker
desktop -> Preferences setup, go to docker engine tab, and add"registry-mirrors": ["http://f1361db2.m.daocloud.io"]
as a new item in the JSON object. When runningdocker info
, the new registry mirror is expected to be seen. Please go tohttps://www.daocloud.io/mirror#accelarator-doc
for more details. -
Install Elastic Search in
Docker
by typing in the following command in Terminal.docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.1
-
Run Elastic Search in Single Node mode by typing in the following command in Terminal.
docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.1
-
Delete an index from ElasticSearch docker by executing
curl -XDELETE 'localhost:9200/car_profile'
. -
In Elastic Search,
index
acts as a DB Name and a table andid
acts like an entry into the table. Idea of type was removed in Elastic Search 7.
-
Create a
/SingleThreadCrawler
directory to store source codes. -
Diagrams of the system
-
Fetcher implemetation details
- Install Go Text library by entering
go get -u golang.org/x/text
in the command line. - The Kanji is in a wrong encoding way, we need to do conversion. Call
transform.NewReader
on the original response body to Convert its decoder fromGBK
toUTF-8
. - For code scalable reason, we also need to install Go Net library by entering
go get -u golang.org/x/net
. This library offers a functionality to detect the decoder from an html text. - Create a new
determineEncoding
that takes in an response bodyio.Reading
and returnencode.Encoding
that includes the decoder format.
- Install Go Text library by entering
-
Parser Implementation Diagram
-
Wrap
parser
functionality as a struct inengine/types
. Create the struct with parse function object and the name of the function. Expose the method to parse contents with Items and more urls in the parse function return.
-
Merge functionality of Fetcher and Parser into a worker function in
engine
. -
For concurrent web crawler, we will simplement a scheduler that schedule
worker
s in theengine
'sRun
function. -
engine
will sendRequest
s to scheduler and scheduler will coordinateworker
s to send request and parse information. Please refer to the diagram below. -
A simple Scheduler will create a
goroutine
for each Request and have a single worker to act on allgoroutine
s. Please refer to the diagram below. -
A Queued Scheduler set up two queues one for worker and the other for request. When a new worker or request comes it, it will add that worker or request item at the back of the queue. When we need the worker to work on the request, we pop both front item from Request and Worker Queue and feed request item into worker item which is a Channel of request, in
engine/worker
, the worker function is going to fecth and parse the request. Please refer to the diagram below.
-
The following diagram illustrate how
ItemSaver
works in the project architecture. -
Install Elastic Search client library, go to
https://github.com/olivere/elastic
. Type ingo get github.com/olivere/elastic/v7
to install Elastic Search 7's client library. -
Create a
save()
function inpersist/itemsaver.go
to save crawled items into Elastic Search system. -
Architecture for UI Display
-
The following diagram shows the architecture of a distributed crawler.
-
Current data flows under channels under a single instance of crawler. Next, we will use RPC Client and RPC Server to split
ItemSaver
logic into some distributed services. Please refer to the architecture below. -
We need put
ItemSaver
into a separate service and expose Rpc call (ItemSaverService.Save
) to the main engine. -
We also need to put
worker
into a separate service and expose Rpc call (CrawlService.Process
) to the main engine. However, the data communicated between CrawlService and engine needs to be serialize and deserialize. Please refer to the diagram below. -
To make
worker
distributed, we implement acreateClientPool()
function that make an array of rpc.Client of the working (each work holds a host identical to an existing worker instance). We feed these rpc.Client one by one into a Channel of*rpc.Client
in a goroutine. Worker client sideCreateProcessor
method listens to this Channel and picks up a*rpc.Client
whenever possible and pass data to one worker server instance through rpc. -
To run the distributed web crawler. First go to
DistributedCrawler
directory by running incd DistributedCrawler
- Start an
ItemSaver
server by runninggo run persist/server/itemsaver.go --port=1234
. - Start two
Worker
server instances by runninggo run worker/server/worker.go --port=9000
andgo run worker/server/worker.go --port=9001
. - Start the engine instance by running
go run main.go --itemsaver_host=":1234" --worker_hosts=":9000,:9001"
. - Both
Worker
instances should be able to fetch data from xcar website and engine can receive these data and passCarDetail
information toItemSaver
to store intoElasticSearch
.
- Start an