Introduction into Go Programming Language

Project Summary

Implemented a web crawler that fetches different cars’ data from a car information website
Store structured cars’ data into ElasticSearch that runs as a web service in a Docker
Implemented a simple webpage that read data from ElasticSearch and display in a data grid
Used Rpc for data communication among engine instance, ItemSaver instance, and distributed Worker’s instances

Installation and Configuration

Official website: https://golang.org/. And download the installation package from the website and install.

Confirm go has been successfully installed by typing in

Yus-MacBook-Pro:~ yyu196$ go version
go version go1.13.7 darwin/amd64

Turn on GO111MODULE and install golangimports dependency by typing in

Yus-MacBook-Pro:~ yyu196$ go env -w GO111MODULE=on
Yus-MacBook-Pro:~ yyu196$ go get -v golang.org/x/tools/cmd/goimports

Install all go related extensions in Visual Studio Code.
Create go.mod file under ./ by typing in
```
Yus-MacBook-Pro:GoLangIntro yyu196$ go mod init FundamentalGrammer
```
Suppose we have a simple helloworld go file called FundamentalGrammer directory, and we can run the file under ./ which contains go.mod by typing
```
Yus-MacBook-Pro:GoLangIntro yyu196$ go run FundamentalGrammer/basic.go 
Hello World
```
Setup proxy in Mainland China. Go to https://github.com/goproxy/goproxy.cn/blob/master/README.zh-CN.md. Default proxy is GOPROXY="https://proxy.golang.org,direct". Type in
```
go env -w GOPROXY=https://goproxy.cn,direct
```
Install gin framework by entering command go get -u github.com/gin-gonic/gin.
Install zap library entering command go get -u go.uber.org/zap

Install ElasticSearch inside a Docker

Docker is written in Go programming language, directly go to https://hub.docker.com/editions/community/docker-ce-desktop-mac to download stable version binary of the Docker software.
In mainland China, we need to use a new registry to pull Docker's mirror. In Docker desktop -> Preferences setup, go to docker engine tab, and add "registry-mirrors": ["http://f1361db2.m.daocloud.io"] as a new item in the JSON object. When running docker info, the new registry mirror is expected to be seen. Please go to https://www.daocloud.io/mirror#accelarator-doc for more details.
Install Elastic Search in Docker by typing in the following command in Terminal.
```
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.6.1
```

Run Elastic Search in Single Node mode by typing in the following command in Terminal.

docker run -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" docker.elastic.co/elasticsearch/elasticsearch:7.6.1

Delete an index from ElasticSearch docker by executing curl -XDELETE 'localhost:9200/car_profile'.
In Elastic Search, index acts as a DB Name and a table and id acts like an entry into the table. Idea of type was removed in Elastic Search 7.

Project: Implement a `go` web crawler on an car information website

Milestone 1: Single Thread web crawler

Create a /SingleThreadCrawler directory to store source codes.
Diagrams of the system
Fetcher implemetation details
- Install Go Text library by entering go get -u golang.org/x/text in the command line.
- The Kanji is in a wrong encoding way, we need to do conversion. Call transform.NewReader on the original response body to Convert its decoder from GBK to UTF-8.
- For code scalable reason, we also need to install Go Net library by entering go get -u golang.org/x/net. This library offers a functionality to detect the decoder from an html text.
- Create a new determineEncoding that takes in an response body io.Reading and return encode.Encoding that includes the decoder format.
Parser Implementation Diagram
Wrap parser functionality as a struct in engine/types. Create the struct with parse function object and the name of the function. Expose the method to parse contents with Items and more urls in the parse function return.

Milestone 2: Concurrent web crawler

Merge functionality of Fetcher and Parser into a worker function in engine.
For concurrent web crawler, we will simplement a scheduler that schedule workers in the engine's Run function.
engine will send Requests to scheduler and scheduler will coordinate workers to send request and parse information. Please refer to the diagram below.
A simple Scheduler will create a goroutine for each Request and have a single worker to act on all goroutines. Please refer to the diagram below.
A Queued Scheduler set up two queues one for worker and the other for request. When a new worker or request comes it, it will add that worker or request item at the back of the queue. When we need the worker to work on the request, we pop both front item from Request and Worker Queue and feed request item into worker item which is a Channel of request, in engine/worker, the worker function is going to fecth and parse the request. Please refer to the diagram below.

Milestone 3: Save crawler result and UI display

The following diagram illustrate how ItemSaver works in the project architecture.
Install Elastic Search client library, go to https://github.com/olivere/elastic. Type in go get github.com/olivere/elastic/v7 to install Elastic Search 7's client library.
Create a save() function in persist/itemsaver.go to save crawled items into Elastic Search system.
Architecture for UI Display

Milestone 4: Make Crawler Distributed

The following diagram shows the architecture of a distributed crawler.
Current data flows under channels under a single instance of crawler. Next, we will use RPC Client and RPC Server to split ItemSaver logic into some distributed services. Please refer to the architecture below.
We need put ItemSaver into a separate service and expose Rpc call (ItemSaverService.Save) to the main engine.
We also need to put worker into a separate service and expose Rpc call (CrawlService.Process) to the main engine. However, the data communicated between CrawlService and engine needs to be serialize and deserialize. Please refer to the diagram below.
To make worker distributed, we implement a createClientPool() function that make an array of rpc.Client of the working (each work holds a host identical to an existing worker instance). We feed these rpc.Client one by one into a Channel of *rpc.Client in a goroutine. Worker client side CreateProcessor method listens to this Channel and picks up a *rpc.Client whenever possible and pass data to one worker server instance through rpc.
To run the distributed web crawler. First go to DistributedCrawler directory by running in cd DistributedCrawler
- Start an ItemSaver server by running go run persist/server/itemsaver.go --port=1234.
- Start two Worker server instances by running go run worker/server/worker.go --port=9000 and go run worker/server/worker.go --port=9001.
- Start the engine instance by running go run main.go --itemsaver_host=":1234" --worker_hosts=":9000,:9001".
- Both Worker instances should be able to fetch data from xcar website and engine can receive these data and pass CarDetail information to ItemSaver to store into ElasticSearch.

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
DistributedCrawler		DistributedCrawler
FundamentalGrammer		FundamentalGrammer
Images		Images
SingleThreadCrawler		SingleThreadCrawler
.gitignore		.gitignore
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction into Go Programming Language

Project Summary

Installation and Configuration

Install ElasticSearch inside a Docker

Project: Implement a `go` web crawler on an car information website

Milestone 1: Single Thread web crawler

Milestone 2: Concurrent web crawler

Milestone 3: Save crawler result and UI display

Milestone 4: Make Crawler Distributed

About

Releases

Packages

Languages

yuyue730/GoLangIntro

Folders and files

Latest commit

History

Repository files navigation

Introduction into Go Programming Language

Project Summary

Installation and Configuration

Install ElasticSearch inside a Docker

Project: Implement a go web crawler on an car information website

Milestone 1: Single Thread web crawler

Milestone 2: Concurrent web crawler

Milestone 3: Save crawler result and UI display

Milestone 4: Make Crawler Distributed

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Project: Implement a `go` web crawler on an car information website

Packages