pagodo
automates Google searching for potentially vulnerable web pages and applications on the Internet. It replaces
manually performing Google dork searches with a web GUI browser.
There are 2 parts. The first is ghdb_scraper.py
that retrieves the latest Google dorks and the second portion is
pagodo.py
that leverages the information gathered by ghdb_scraper.py
.
The core Google search library now uses the more flexible yagooglesearch instead of googlesearch. Check out the yagooglesearch README for a more in-depth explanation of the library differences and capabilities.
This version of pagodo
also supports native HTTP(S) and SOCKS5 application support, so no more wrapping it in a tool
like proxychains4
if you need proxy support. You can specify multiple proxies to use in a round-robin fashion by
providing a comma separated string of proxies using the -p
switch.
Offensive Security maintains the Google Hacking Database (GHDB) found here: https://www.exploit-db.com/google-hacking-database. It is a collection of Google searches, called dorks, that can be used to find potentially vulnerable boxes or other juicy info that is picked up by Google's search bots.
The terms and conditions for pagodo
are the same terms and conditions found in
yagooglesearch.
This code is supplied as-is and you are fully responsible for how it is used. Scraping Google Search results may violate their Terms of Service. Another Python Google search library had some interesting information/discussion on it:
- Original issue
- A response
- Author created a separate Terms and Conditions
- ...that contained link to this blog
Google's preferred method is to use their API.
Scripts are written for Python 3.6+. Clone the git repository and install the requirements.
git clone https://github.com/opsdisk/pagodo.git
cd pagodo
python3 -m venv .venv # If using a virtual environment.
source .venv/bin/activate # If using a virtual environment.
pip install -r requirements.txt
To start off, pagodo.py
needs a list of all the current Google dorks. The repo contains a dorks/
directory with the
current dorks when the ghdb_scraper.py
was last run. It's advised to run ghdb_scraper.py
to get the freshest data
before running pagodo.py
. The dorks/
directory contains:
- the
all_google_dorks.txt
file which contains all the Google dorks, one per line - the
all_google_dorks.json
file which is the JSON response from GHDB - Individual category dorks
Dork categories:
categories = {
1: "Footholds",
2: "File Containing Usernames",
3: "Sensitives Directories",
4: "Web Server Detection",
5: "Vulnerable Files",
6: "Vulnerable Servers",
7: "Error Messages",
8: "File Containing Juicy Info",
9: "File Containing Passwords",
10: "Sensitive Online Shopping Info",
11: "Network or Vulnerability Data",
12: "Pages Containing Login Portals",
13: "Various Online devices",
14: "Advisories and Vulnerabilities",
}
Write all dorks to all_google_dorks.txt
, all_google_dorks.json
, and individual categories if you want more
contextual data about each dork.
python ghdb_scraper.py -s -j -i
The ghdb_scraper.retrieve_google_dorks()
function returns a dictionary with the following data structure:
ghdb_dict = {
"total_dorks": total_dorks,
"extracted_dorks": extracted_dorks,
"category_dict": category_dict,
}
Using a Python shell (like python
or ipython
) to explore the data:
import ghdb_scraper
dorks = ghdb_scraper.retrieve_google_dorks(save_all_dorks_to_file=True)
dorks.keys()
dorks["total_dorks"]
dorks["extracted_dorks"]
dorks["category_dict"].keys()
dorks["category_dict"][1]["category_name"]
python pagodo.py -d example.com -g dorks.txt
The pagodo.Pagodo.go()
function returns a dictionary with the data structure below (dorks used are made up examples):
{
"dorks": {
"inurl:admin": {
"urls_size": 3,
"urls": [
"https://github.com/marmelab/ng-admin",
"https://github.com/settings/admin",
"https://github.com/akveo/ngx-admin",
],
},
"inurl:gist": {
"urls_size": 3,
"urls": [
"https://gist.github.com/",
"https://gist.github.com/index",
"https://github.com/defunkt/gist",
],
},
},
"initiation_timestamp": "2021-08-27T11:35:30.638705",
"completion_timestamp": "2021-08-27T11:36:42.349035",
}
Using a Python shell (like python
or ipython
) to explore the data:
import pagodo
pg = pagodo.Pagodo(
google_dorks_file="dorks.txt",
domain="github.com",
max_search_result_urls_to_return_per_dork=3,
save_pagodo_results_to_json_file=None, # None = Auto-generate file name, otherwise pass a string for path and filename.
save_urls_to_file=None, # None = Auto-generate file name, otherwise pass a string for path and filename.
verbosity=5,
)
pagodo_results_dict = pg.go()
pagodo_results_dict.keys()
pagodo_results_dict["initiation_timestamp"]
pagodo_results_dict["completion_timestamp"]
for key,value in pagodo_results_dict["dorks"].items():
print(f"dork: {key}")
for url in value["urls"]:
print(url)
The -d
switch can be used to scope the results to a specific domain and functions as the Google search operator:
site:github.com
-i
- Specify the minimum delay between dork searches, in seconds. Don't make this too small, or your IP will get HTTP 429'd quickly.-x
- Specify the maximum delay between dork searches, in seconds. Don't make this too big or the searches will take a long time.
The values provided by -i
and -x
are used to generate a list of 20 randomly wait times, that are randomly selected
between each different Google dork search.
-m
- The total max search results to return per Google dork. Each Google search request can pull back at most 100
results at a time, so if you pick -m 500
, 5 separate search queries will have to be made for each Google dork search,
which will increase the amount of time to complete.
-o [optional/path/to/results.json]
- Save output to a JSON file. If you do not specify a filename, a datetimestamped
one will be generated.
-s [optional/path/to/results.txt]
- Save URLs to a text file. If you do not specify a filename, a datetimestamped one
will be generated.
--log [optional/path/to/file.log]
- Save logs to the specified file. If you do not specify a filename, the default
file pagodo.py.log
at the root of pagodo directory will be used.
Performing 7300+ search requests to Google as fast as possible will simply not work. Google will rightfully detect it
as a bot and block your IP for a set period of time. One solution is to use a bank of HTTP(S)/SOCKS proxies and pass
them to pagodo
Pass a comma separated string of proxies to pagodo
using the -p
switch.
python pagodo.py -g dorks.txt -p http://myproxy:8080,socks5h://127.0.0.1:9050,socks5h://127.0.0.1:9051
You could even decrease the -i
and -x
values because you will be leveraging different proxy IPs. The proxies passed
to pagodo
are selected by round robin.
Another solution is to use proxychains4
to round robin the lookups.
Install proxychains4
apt install proxychains4 -y
Edit the /etc/proxychains4.conf
configuration file to round robin the look ups through different proxy servers. In
the example below, 2 different dynamic socks proxies have been set up with different local listening ports (9050 and
9051).
vim /etc/proxychains4.conf
round_robin
chain_len = 1
proxy_dns
remote_dns_subnet 224
tcp_read_time_out 15000
tcp_connect_time_out 8000
[ProxyList]
socks4 127.0.0.1 9050
socks4 127.0.0.1 9051
Throw proxychains4
in front of the pagodo.py
script and each request lookup will go through a different proxy (and
thus source from a different IP).
proxychains4 python pagodo.py -g dorks/all_google_dorks.txt -o [optional/path/to/results.json] -s [optional/path/to/results.txt]
Note that this may not appear natural to Google if you:
- Simulate "browsing" to
google.com
from IP #1 - Make the first search query from IP #2
- Simulate clicking "Next" to make the second search query from IP #3
- Simulate clicking "Next to make the third search query from IP #1
For that reason, using the built in -p
proxy support is preferred because, as stated in the yagooglesearch
documentation, the "provided proxy is used for the entire life cycle of the search to make it look more human, instead
of rotating through various proxies for different portions of the search."
Distributed under the GNU General Public License v3.0. See LICENSE for more information.
Project Link: https://github.com/opsdisk/pagodo