Pagser inspired by page parser。
Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler.
go get -u github.com/foolin/pagser
Or get the specified version:
go get github.com/foolin/pagser@{version}
The {version} release list: https://github.com/foolin/pagser/releases
- Simple - Use golang struct tag syntax.
- Easy - Easy use for your spider/crawler/colly application.
- Extensible - Support for extension functions.
- Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
- Nested Structure - Support Nested Structure for node.
- Configurable - Support configuration.
- Implicit type conversion - Automatic implicit type conversion, Output result string convert to int, int64, float64...
- GoQuery/Colly - Support all goquery project, such as go-colly.
See Pagser
package main
import (
"encoding/json"
"github.com/foolin/pagser"
"log"
)
const rawPageHtml = `
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Pagser Title</title>
<meta name="keywords" content="golang,pagser,goquery,html,page,parser,colly">
</head>
<body>
<h1>H1 Pagser Example</h1>
<div class="navlink">
<div class="container">
<ul class="clearfix">
<li id=''><a href="/">Index</a></li>
<li id='2'><a href="/list/web" title="web site">Web page</a></li>
<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
</ul>
</div>
</div>
</body>
</html>
`
type PageData struct {
Title string `pagser:"title"`
Keywords []string `pagser:"meta[name='keywords']->attrSplit(content)"`
H1 string `pagser:"h1"`
Navs []struct {
ID int `pagser:"->attrEmpty(id, -1)"`
Name string `pagser:"a->text()"`
Url string `pagser:"a->attr(href)"`
} `pagser:".navlink li"`
}
func main() {
//New default config
p := pagser.New()
//data parser model
var data PageData
//parse html data
err := p.Parse(&data, rawPageHtml)
//check error
if err != nil {
log.Fatal(err)
}
//print data
log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}
func toJson(v interface{}) string {
data, _ := json.MarshalIndent(v, "", "\t")
return string(data)
}
Run output:
Page data json:
-------------
{
"Title": "Pagser Title",
"Keywords": [
"golang",
"pagser",
"goquery",
"html",
"page",
"parser",
"colly"
],
"H1": "H1 Pagser Example",
"Navs": [
{
"ID": -1,
"Name": "Index",
"Url": "/"
},
{
"ID": 2,
"Name": "Web page",
"Url": "/list/web"
},
{
"ID": 3,
"Name": "Pc Page",
"Url": "/list/pc"
},
{
"ID": 4,
"Name": "Mobile Page",
"Url": "/list/mobile"
}
]
}
-------------
type Config struct {
TagName string //struct tag name, default is `pagser`
FuncSymbol string //Function symbol, default is `->`
Debug bool //Debug mode, debug will print some log, default is `false`
}
[goquery selector]->[function]
Example:
type ExamData struct {
Herf string `pagser:".navLink li a->attr(href)"`
}
1.Struct tag name:
pagser
2.goquery selector:.navLink li a
3.Function symbol:->
4.Function name:attr
5.Function arguments:href
- text() get element text, return string, this is default function, if not define function in struct tag.
- eachText() get each element text, return []string.
- html() get element inner html, return string.
- eachHtml() get each element inner html, return []string.
- outerHtml() get element outer html, return string.
- eachOutHtml() get each element outer html, return []string.
- attr(name) get element attribute value, return string.
- eachAttr() get each element attribute value, return []string.
- attrSplit(name, sep) get attribute value and split by separator to array string.
- attr('value') get element attribute value by name is
value
, return string, eg: will return "xxx".
- textSplit(sep) get element text and split by separator to array string, return []string.
- eachTextJoin(sep) get each element text and join to string, return string.
- eq(index) reduces the set of matched elements to the one at the specified index, return Selection for nested struct.
- ...
More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser?tab=doc#BuiltinFunctions
- Markdown() //convert html to markdown format.
- UgcHtml() //sanitize html
Extensions function need register, like:
import "github.com/foolin/pagser/extensions/markdown"
p := pagser.New()
//Register Markdown
markdown.Register(p)
type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)
//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
return "Global-" + node.Text(), nil
}
type PageData struct{
MyGlobalValue string `pagser:"->MyGlob()"`
}
func main(){
p := pagser.New()
//Register global function `MyGlob`
p.RegisterFunc("MyGlob", MyGlobalFunc)
//Todo
//data parser model
var data PageData
//parse html data
err := p.Parse(&data, rawPageHtml)
//...
}
type PageData struct{
MyFuncValue int `pagser:"->MyFunc()"`
}
// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
return "Struct-" + node.Text(), nil
}
func main(){
p := pagser.New()
//Todo
//data parser model
var data PageData
//parse html data
err := p.Parse(&data, rawPageHtml)
//...
}
Note: all function arguments are string, single quotes are optional.
- Function call with no arguments
->fn()
- Function calls with one argument, and single quotes are optional
->fn(one)
->fn('one')
- Function calls with many arguments
->fn(one, two, three, ...)
->fn('one', 'two', 'three', ...)
- Function calls with single quotes and escape character
->fn('it\'s ok', 'two,xxx', 'three', ...)
Lookup function priority order:
struct method -> parent method -> ... -> global
See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance
Automatic implicit type conversion, Output result string convert to int, int64, float64...
Support type:
- bool
- float32
- float64
- int
- int32
- int64
- string
- []bool
- []float32
- []float64
- []int
- []int32
- []int64
- []string
package main
import (
"encoding/json"
"github.com/foolin/pagser"
"log"
"net/http"
)
type PageData struct {
Title string `pagser:"title"`
RepoList []struct {
Names []string `pagser:"h1->textSplit('/', true)"`
Description string `pagser:"h1 + p"`
Stars string `pagser:"a.muted-link->eqAndText(0)"`
Repo string `pagser:"h1 a->attrConcat('href', 'https://github.com', $value, '?from=pagser')"`
} `pagser:"article.Box-row"`
}
func main() {
resp, err := http.Get("https://github.com/trending")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
//New default config
p := pagser.New()
//data parser model
var data PageData
//parse html data
err = p.ParseReader(&data, resp.Body)
//check error
if err != nil {
log.Fatal(err)
}
//print data
log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}
func toJson(v interface{}) string {
data, _ := json.MarshalIndent(v, "", "\t")
return string(data)
}
Run output:
2020/04/25 12:26:04 Page data json:
-------------
{
"Title": "Trending repositories on GitHub today · GitHub",
"RepoList": [
{
"Names": [
"pcottle",
"learnGitBranching"
],
"Description": "An interactive git visualization to challenge and educate!",
"Stars": "16,010",
"Repo": "https://github.com/pcottle/learnGitBranching?from=pagser"
},
{
"Names": [
"jackfrued",
"Python-100-Days"
],
"Description": "Python - 100天从新手到大师",
"Stars": "83,484",
"Repo": "https://github.com/jackfrued/Python-100-Days?from=pagser"
},
{
"Names": [
"brave",
"brave-browser"
],
"Description": "Next generation Brave browser for macOS, Windows, Linux, Android.",
"Stars": "5,963",
"Repo": "https://github.com/brave/brave-browser?from=pagser"
},
{
"Names": [
"MicrosoftDocs",
"azure-docs"
],
"Description": "Open source documentation of Microsoft Azure",
"Stars": "3,798",
"Repo": "https://github.com/MicrosoftDocs/azure-docs?from=pagser"
},
{
"Names": [
"ahmetb",
"kubectx"
],
"Description": "Faster way to switch between clusters and namespaces in kubectl",
"Stars": "6,979",
"Repo": "https://github.com/ahmetb/kubectx?from=pagser"
},
//...
{
"Names": [
"serverless",
"serverless"
],
"Description": "Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \u0026 more! –",
"Stars": "35,502",
"Repo": "https://github.com/serverless/serverless?from=pagser"
},
{
"Names": [
"vuejs",
"vite"
],
"Description": "Experimental no-bundle dev server for Vue SFCs",
"Stars": "1,573",
"Repo": "https://github.com/vuejs/vite?from=pagser"
}
]
}
-------------
Work with colly:
p := pagser.New()
// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
//data parser model
var data PageData
//parse html data
err := p.ParseSelection(&data, e.Dom)
})
-
github.com/PuerkitoBio/goquery
-
github.com/spf13/cast
Extensions:
-
github.com/mattn/godown
-
github.com/microcosm-cc/bluemonday