TidierVest.jl

Simple web scraping with Julia

This library combines HTTP, Gumbo and Cascadia for a more simple way to scrape data.

Usage

using TidierVest

starwars = read_html("https://rvest.tidyverse.org/articles/starwars.html")

titles = html_elements(starwars, ["section", "h2"]) |> html_text3
titles
# 7-element Vector{String}:
#  "The Phantom Menace"
#  "Attack of the Clones"
#  "Revenge of the Sith"
#  ⋮
#  "Return of the Jedi"
#  "The Force Awakens"

html = read_html("https://en.wikipedia.org/w/index.php?title=The_Lego_Movie&oldid=998422565")
table = html_elements(html, ".tracklist") |> html_table
table
# 28×4 DataFrame
#  Row │ No.     Title                              Performer(s)                       Length 
#      │ String  String                             String                             String 
# ─────┼──────────────────────────────────────────────────────────────────────────────────────
#    1 │ 1.      "Everything Is Awesome"            Tegan and Sara featuring The Lon…  2:43   
#    2 │ 2.      "Prologue"                                                            2:28   
#    3 │ 3.      "Emmett's Morning"                                                    2:00   
#    4 │ 4.      "Emmett Falls in Love"                                                1:11   
#    5 │ 5.      "Escape"                                                              3:26
#   ⋮  │   ⋮                     ⋮                                  ⋮                    ⋮
#   25 │ 25.     "Everything Is Awesome"            Jo Li (Joshua Bartholomew and Li…  1:26
#   26 │ 26.     "Everything Is Awesome (unplugge…  Shawn Patterson and Sammy Allen    1:24
#   27 │ 27.     "Untitled Self Portrait"           Will Arnett                        1:08
#   28 │ 28.     "Everything Is Awesome (instrume…                                     2:41
#                                                                              19 rows omitted

Functions

`read_html`

Read an url

`parse_html`

Parses a string into an HTML Document type

`html_elements`

Get the elements you want from an html

`html_text`

Get the text, you can also use html_text2 or html_text3 for cleaner text

`html_attrs`

Get the content of an attribute, if string not provided it would try to get you an attribute

`html_table`

Create a DataFrame from an HTML Table node

`html_children`

Return the children of an html

`minimal_html`

Create an html document with inline html

Notes

I'm actively accepting suggestions

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TidierVest.jl

Simple web scraping with Julia

Usage

Functions

`read_html`

`parse_html`

`html_elements`

`html_text`

`html_attrs`

`html_table`

`html_children`

`minimal_html`

Notes

About

Releases

Packages

Contributors 4

Languages

License

TidierOrg/TidierVest.jl

Folders and files

Latest commit

History

Repository files navigation

TidierVest.jl

Simple web scraping with Julia

Usage

Functions

read_html

parse_html

html_elements

html_text

html_attrs

html_table

html_children

minimal_html

Notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

`read_html`

`parse_html`

`html_elements`

`html_text`

`html_attrs`

`html_table`

`html_children`

`minimal_html`

Packages