GitHub - MohammadRaziei/liburlparser: Fastest domain extractor library written in C++ with python binding.

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

About The Project

liburlparser is a powerful domain extractor library written in C++ with Python bindings. It provides efficient URL parsing capabilities for both C++ and Python, making it a valuable tool for projects that involve working with web addresses.

Features

Here are some key features of liburlparser:

Multiple Language Support:
- liburlparser can be used in multiple programming languages, including Python, C++, and Shell.
- It offers an intuitive interface that remains consistent across both C++ and Python.
Clean Code Design:
- The library provides two separate classes: Url and Host.
- This separation allows for cleaner and more organized code when dealing with URLs.
Public Suffix List Support:
- liburlparser supports known combinatorial suffixes (e.g., "ac.ir") using the public_suffix_list.
- It can also handle unknown suffixes (e.g., "comm" in "google.comm").
Automatic Public Suffix List Updates:
- Before each build and deployment, liburlparser updates the public_suffix_list automatically.
Host Properties:
- The Host class includes properties such as subdomain, domain, domain name, and suffix.
URL Properties:
- The Url class provides properties like protocol, userinfo, host (and all host properties), port, path, query parameters, and fragment.

Usage

Command Line

python -m liburlparser --help # show help section
python -m liburlparser --version # show version
python -m liburlparser --url "https://mail.google.com/about" | jq #return as json
python -m liburlparser --host "mail.google.com" | jq # return as json

Python

you can use liburlparser so intutively

all of classes has help section

import liburlparser
help(liburlparser)
print(liburlparser.__version__)

from liburlparser import Url, Host
help(Url)
help(Host)

parse url and host

from liburlparser import Url, Host
## parse url:
url = Url("https://ee.aut.ac.ir/#id") # parse all part of url
print(url, url.suffix, url.domain, url.fragment, url.host, url.to_dict(), url.to_json())
## parse host
host = url.host # ee.aut.ac.ir
# or
host = Host("ee.aut.ac.ir")
# or 
host = Host.from_url("https://ee.aut.ac.ir/#id") # the fastest way for parsing host from url
# all of these methods return an object of Host class which already parse the host part of url 
print(host, host.domain, host.suffix, host.to_dict(), host.to_json())

Also there is some helping api to get better performance for some small tasks

# if you need to extract the host of url as a string without any parsing 
host_str = Url.extract_host("https://ee.aut.ac.ir/about") # very fast

if you are fan of pydomainextractor, there is some interface similar to it

import pydomainextractor
extractor = pydomainextractor.DomainExtractor()
extractor.extract("ee.aut.ac.ir") # from host
extractor.extract_from_url("https://ee.aut.ac.ir/about") # from url

# alternatively you can use:
from liburlparser import Host
Host.extract("ee.aut.ac.ir") # from host
Host.extract_from_url("https://ee.aut.ac.ir/about") # from url
# you can see there is the same api

C++

there is some examples in examples folder

#include "urlparser.h"
...
/// for parsing url
TLD::Url url("https://ee.aut.ac.ir/about");
std::string domain = url.domain(); // also for subdomain, port, params, ...
/// for parsing host
TLD::Host host("ee.aut.ac.ir");
// or
TLD::Host host = url.host();
// or
TLD::Host host = TLD::Host::fromUrl("https://ee.aut.ac.ir/about");

you can see all methods in python we can use in c++ very easily

Installation

C++:

build steps:

git clone https://github.com/mohammadraziei/liburlparser
mkdir -p build; cd build
cmake ..
# Build the project:
make
# [Optional] run tests:
make test
# [Optional] make documents:
make docs
# [Optional] Run examples:
./example
# Make install
sudo make install

Python and Command Line:

Be aware that it required python>=3.8

Installation

pip by pypi

pip install liburlparser

if you want to use psl.update to update the public suffix list, you must install the online version

pip install "liburlparser[online]"

Or

pip by git

pip install git+https://github.com/mohammadraziei/liburlparser

Or

manually

git clone https://github.com/mohammadraziei/liburlparser
pip install ./liburlparser

Performance

Extract From Host

Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)

Library	Function	Time
liburlparser	liburlparser.Host	1.12s
PyDomainExtractor	pydomainextractor.extract	1.50s
publicsuffix2	publicsuffix2.get_sld	9.92s
tldextract	__call__	29.23s
tld	tld.parse_tld	34.48s

Extract From URL

The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)

Library	Function	Time
liburlparser	liburlparser.Host.from_url	2.10s
PyDomainExtractor	pydomainextractor.extract_from_url	2.24s
publicsuffix2	publicsuffix2.get_sld	10.84s
tldextract	__call__	36.04s
tld	tld.parse_tld	57.87s

Name		Name	Last commit message	Last commit date
Latest commit History 163 Commits
.github/workflows		.github/workflows
cmake		cmake
conda.recipe		conda.recipe
docs		docs
examples		examples
include		include
src		src
tests		tests
third_party		third_party
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

About The Project

Features

Usage

Command Line

Python

C++

Installation

C++:

build steps:

Python and Command Line:

Installation

pip by pypi

pip by git

manually

Performance

Extract From Host

Extract From URL

License

Stats

Contact

About

Releases 15

Packages

Languages

License

MohammadRaziei/liburlparser

Folders and files

Latest commit

History

Repository files navigation

Fastest domain extractor library written in C++ with python binding.

First and complete library for parsing url in C++ and Python and Command Line

About The Project

Features

Usage

Command Line

Python

C++

Installation

C++:

build steps:

Python and Command Line:

Installation

pip by pypi

pip by git

manually

Performance

Extract From Host

Extract From URL

License

Stats

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 15

Packages 0

Languages

Packages