Date: September 9th, 2020
In this project, I created a simple search engine for the website https://www.lawfareblog.com. The search engine emulates Google's former search engine algorithm, PageRank.
Here I implement the WebGraph.power_method
function in pagerank.py
for computing the pagerank vector.
Part 1: Check if the power method is correctly implemented.
$ python3 pagerank.py --data=./small.csv.gz --verbose
DEBUG:root:computing indices
DEBUG:root:computing values
INFO:root:rank=0 pagerank=2.1634e+00 url=4
INFO:root:rank=1 pagerank=1.6664e+00 url=6
INFO:root:rank=2 pagerank=1.2402e+00 url=5
INFO:root:rank=3 pagerank=4.5712e-01 url=2
INFO:root:rank=4 pagerank=3.5620e-01 url=3
INFO:root:rank=5 pagerank=3.2078e-01 url=1
Part 2:
The pagerank.py
file has an option --search_query
, which takes a string as a parameter.
If this argument is used, then program returns all urls that match the query string sorted according to their pagerank.
Essentially, this gives us the most important pages on the blog related to our query. Recently, I updated
the search engine so that it is more robust. The package gensim, offers in searches by giving similar
words to the search query.
$ python3 pagerank.py --data=./lawfareblog.csv.gz --search_query='corona'
INFO:gensim.models.utils_any2vec:loading projection weights from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.utils_any2vec:loaded (1193514, 25) matrix from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=0.001003776676952839 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=1 pagerank=0.0008922395063564181 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=2 pagerank=0.0007039029151201248 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=3 pagerank=0.0006915341946296394 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=4 pagerank=0.000670412031468004 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=5 pagerank=0.0006625585374422371 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus
INFO:root:rank=6 pagerank=0.0006504578050225973 url=www.lawfareblog.com/congressional-homeland-security-committees-seek-ways-support-state-federal-responses-coronavirus
INFO:root:rank=7 pagerank=0.0006361958803609014 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=8 pagerank=0.000612482544966042 url=www.lawfareblog.com/house-subcommittee-voices-concerns-over-us-management-coronavirus
INFO:root:rank=9 pagerank=0.0006018723943270743 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response
$ python3 pagerank.py --data=./lawfareblog.csv.gz --search_query='trump'
INFO:gensim.models.utils_any2vec:loading projection weights from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.utils_any2vec:loaded (1193514, 25) matrix from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=0.005782557651400566 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=0.005233839154243469 url=www.lawfareblog.com/document-trump-revokes-obama-executive-order-counterterrorism-strike-casualty-reporting
INFO:root:rank=2 pagerank=0.005129670724272728 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=3 pagerank=0.004659898113459349 url=www.lawfareblog.com/dc-circuit-overrules-district-courts-due-process-ruling-qasim-v-trump
INFO:root:rank=4 pagerank=0.004593398422002792 url=www.lawfareblog.com/donald-trump-and-politically-weaponized-executive-branch
INFO:root:rank=5 pagerank=0.004307133611291647 url=www.lawfareblog.com/how-trumps-approach-middle-east-ignores-past-future-and-human-condition
INFO:root:rank=6 pagerank=0.0040934765711426735 url=www.lawfareblog.com/why-trump-cant-buy-greenland
INFO:root:rank=7 pagerank=0.0037590833380818367 url=www.lawfareblog.com/oral-argument-summary-qassim-v-trump
INFO:root:rank=8 pagerank=0.003450872143730521 url=www.lawfareblog.com/dc-circuit-court-denies-trump-rehearing-mazars-case
INFO:root:rank=9 pagerank=0.0034484383650124073 url=www.lawfareblog.com/second-circuit-rules-mazars-must-hand-over-trump-tax-returns-new-york-prosecutors
$ python3 pagerank.py --data=./lawfareblog.csv.gz --search_query='iran'
INFO:gensim.models.utils_any2vec:loading projection weights from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.utils_any2vec:loaded (1193514, 25) matrix from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.keyedvectors:precomputing L2-norms of word weight vectors
INFO:root:rank=0 pagerank=0.005129670724272728 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=1 pagerank=0.004574609454721212 url=www.lawfareblog.com/praise-presidents-iran-tweets
INFO:root:rank=2 pagerank=0.004417411983013153 url=www.lawfareblog.com/how-us-iran-tensions-could-disrupt-iraqs-fragile-peace
INFO:root:rank=3 pagerank=0.004365929868072271 url=www.lawfareblog.com/haftar-attacking-tripoli-us-needs-re-engage-libya
INFO:root:rank=4 pagerank=0.0026927897706627846 url=www.lawfareblog.com/cyber-command-operational-update-clarifying-june-2019-iran-operation
INFO:root:rank=5 pagerank=0.0019391420064494014 url=www.lawfareblog.com/aborted-iran-strike-fine-line-between-necessity-and-revenge
INFO:root:rank=6 pagerank=0.0015452124644070864 url=www.lawfareblog.com/parsing-state-departments-letter-use-force-against-iran
INFO:root:rank=7 pagerank=0.0015357456868514419 url=www.lawfareblog.com/iranian-hostage-crisis-and-its-effect-american-politics
INFO:root:rank=8 pagerank=0.001525763189420104 url=www.lawfareblog.com/announcing-united-states-and-use-force-against-iran-new-lawfare-e-book
INFO:root:rank=9 pagerank=0.0014220948796719313 url=www.lawfareblog.com/us-names-iranian-revolutionary-guard-terrorist-organization-and-sanctions-international-criminal
Part 3: The webgraph of lawfareblog.com (the P matrix) naturally contains a lot of structure. For example, essentially all pages on the domain have links to the root page https://lawfareblog.com/ and similarly broad pages like https://www.lawfareblog.com/topics and https://www.lawfareblog.com/subscribe-lawfare. These pages therefore have a large pagerank. We can get a list of the pages with the largest pagerank by running
$ python3 pagerank.py --data=./lawfareblog.csv.gz
INFO:gensim.models.utils_any2vec:loading projection weights from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.utils_any2vec:loaded (1193514, 25) matrix from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:root:rank=0 pagerank=0.2874051630496979 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=1 pagerank=0.2874051630496979 url=www.lawfareblog.com/masthead
INFO:root:rank=2 pagerank=0.2874051630496979 url=www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
INFO:root:rank=3 pagerank=0.2874051630496979 url=www.lawfareblog.com/documents-related-mueller-investigation
INFO:root:rank=4 pagerank=0.2874051630496979 url=www.lawfareblog.com/topics
INFO:root:rank=5 pagerank=0.2874051630496979 url=www.lawfareblog.com/about-lawfare-brief-history-term-and-site
INFO:root:rank=6 pagerank=0.2874051630496979 url=www.lawfareblog.com/snowden-revelations
INFO:root:rank=7 pagerank=0.2874051630496979 url=www.lawfareblog.com/support-lawfare
INFO:root:rank=8 pagerank=0.2874051630496979 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=9 pagerank=0.2874051630496979 url=www.lawfareblog.com/our-comments-policy
These pages are not very interesting, however, because they are not articles. How can we find the most important articles? The answer is to modify the P matrix by removing all links to non-article pages.
How do we know if a link is a non-article page?
This is a hard question to answer with 100% accuracy,
but there are many methods that get us most of the way there.
An easy method is to remove all pages that have "too many" links.
The --filter_ratio
argument removes all pages that have more links than the specified fraction.
(If that short explanation doesn't make sense, that's okay.
You can either read the source for details or just accept that the option does magic.)
Using this option, we can estimate the most important articles on the domain with the following command:
$ python3 pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2
INFO:gensim.models.utils_any2vec:loading projection weights from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:gensim.models.utils_any2vec:loaded (1193514, 25) matrix from /home/oliver/gensim-data/glove-twitter-25/glove-twitter-25.gz
INFO:root:rank=0 pagerank=0.3469614088535309 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=0.29521217942237854 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=0.29039672017097473 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=0.15178653597831726 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=0.15098513662815094 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=5 pagerank=0.15098513662815094 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=6 pagerank=0.15071173012256622 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=0.14956681430339813 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=0.14366623759269714 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=0.14239734411239624 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull
When Google calculates their P matrix for the web, they use a similar process to modify the P matrix in order to reduce spam results. The exact formula they use, however, is a jealously guarded secret.
In the case above, notice that we have removed the blog's most popular article (www.lawfareblog.com/snowden-revelations).
The blog editors believe that Snowden's revelations about NSA spying are so important that they link to the article on every single webpage in the domain,
and out "anti-spam" --filter-ratio
argument removed this article from the list.
In general, it is a challenging open problem to remove spam from pagerank results,
and all current solutions rely on careful human tuning and still have lots of false positives and false negatives.
Part 4:
Recall from the reading that the runtime of pagerank depends heavily on the eigengap of the
$ python3 pagerank.py --data=./lawfareblog.csv.gz --verbose
$ python3 pagerank.py --data=./lawfareblog.csv.gz --verbose --alpha=0.99999
$ python3 pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2
$ python3 pagerank.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999
The first and third commands took considerabely fewer iterations than the second and fourth. (My code takes 685 iterations for this call, and about 10 iterations for all the others.)
This begs the question does the second command (with the --alpha
option but without the --filter_ratio
) option not take a long time to run?
The answer is that the P graph for www.lawfareblog.com naturally has a large eigengap and so is fast to compute for all alpha values,
but the modified graph does not have a large eigengap and so requires a small alpha for fast convergence.
Changing the value of alpha also gives us very different pagerank rankings. For example,
$ python3 pagerank_solution.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2
INFO:root:rank=0 pagerank=0.3469614088535309 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=0.29521217942237854 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=0.29039672017097473 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=0.15178653597831726 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=0.15098513662815094 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=5 pagerank=0.15098513662815094 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=6 pagerank=0.15071173012256622 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=0.14956681430339813 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=0.14366623759269714 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=0.14239734411239624 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull
$ python3 pagerank_solution.py --data=./lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999
INFO:root:rank=0 pagerank=0.7014895081520081 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=0.7014873027801514 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=0.10551577061414719 url=www.lawfareblog.com/cost-using-zero-days
INFO:root:rank=3 pagerank=0.03175730258226395 url=www.lawfareblog.com/lawfare-podcast-former-congressman-brian-baird-and-daniel-schuman-how-congress-can-continue-function
INFO:root:rank=4 pagerank=0.02203981764614582 url=www.lawfareblog.com/events
INFO:root:rank=5 pagerank=0.016026929020881653 url=www.lawfareblog.com/water-wars-increased-us-focus-indo-pacific
INFO:root:rank=6 pagerank=0.01602618768811226 url=www.lawfareblog.com/water-wars-drill-maybe-drill
INFO:root:rank=7 pagerank=0.016023021191358566 url=www.lawfareblog.com/water-wars-disjointed-operations-south-china-sea
INFO:root:rank=8 pagerank=0.016020189970731735 url=www.lawfareblog.com/water-wars-song-oil-and-fire
INFO:root:rank=9 pagerank=0.01602018065750599 url=www.lawfareblog.com/water-wars-sinking-feeling-philippine-china-relations
Which of these rankings is better is entirely subjective, and the only way to know if you have the "best" alpha for your application is to try several variations and see what is best. If large alphas are good for your application, you can see that there is a tradeoff between quality answers and algorithmic runtime. We'll be exploring this tradeoff more formally in class when we incorporate statistics into our discussion.
The most interesting applications of pagerank involve the personalization vector.
Part 1:
Here I implemented the WebGraph.make_personalization_vector
function.
This function enables the --personalization_vector_query
command line argument,
which provides an alternative method for searching by doing the filtering on the personalization vector.
Observe the following commands.
$ python3 pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona'
INFO:root:rank=0 pagerank=0.6321446895599365 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=0.6321207880973816 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=0.1512296497821808 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=0.11724846810102463 url=www.lawfareblog.com/rational-security-my-corona-edition
INFO:root:rank=4 pagerank=0.11724846810102463 url=www.lawfareblog.com/brexit-not-immune-coronavirus
INFO:root:rank=5 pagerank=0.08947845548391342 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=6 pagerank=0.08648215234279633 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=7 pagerank=0.08648215234279633 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=8 pagerank=0.07284898310899734 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=9 pagerank=0.06962569057941437 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
Notice that these results are significantly different than when using the --search_query
option:
$ python3 pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --search_query='corona'
INFO:root:rank=0 pagerank=0.008131963200867176 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=1 pagerank=0.0077908276580274105 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=2 pagerank=0.005226221401244402 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response
INFO:root:rank=3 pagerank=0.003958384972065687 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=4 pagerank=0.003811446251347661 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=5 pagerank=0.0033972852397710085 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=6 pagerank=0.0033633282873779535 url=www.lawfareblog.com/cyberlaw-podcast-how-israel-fighting-coronavirus
INFO:root:rank=7 pagerank=0.0033556802663952112 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=8 pagerank=0.0032159897964447737 url=www.lawfareblog.com/congress-needs-coronavirus-failsafe-its-too-late
INFO:root:rank=9 pagerank=0.003103649476543069 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus
Which results are better?
That depends on what you mean by "better."
With the --personalization_vector_query
option,
a webpage is important only if other coronavirus webpages also think it's important;
with the --search_query
option,
a webpage is important if any other webpage thinks it's important.
You'll notice that in the later example, many of the webpages are about Congressional proceedings related to the coronavirus.
From a strictly coronavirus perspective, these are not very important webpages.
But in the broader context of national security, these are very important webpages.
Google engineers spend TONs of time fine-tuning their pagerank personalization vectors to remove spam webpages. Exactly how they do this is another one of their secrets that they don't publicly talk about.
Part 2:
Another use of the --personalization_vector_query
option is that we can find out what webpages are related to the coronavirus but don't directly mention the coronavirus.
This can be used to map out what types of topics are similar to the coronavirus.
For example, the following query ranks all webpages by their corona
importance,
but removes webpages mentioning corona
from the results.
$ python3 pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona' --search_query='-corona'
INFO:root:rank=0 pagerank=0.6321446895599365 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=0.6321207880973816 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=0.1512296497821808 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=0.11724846810102463 url=www.lawfareblog.com/rational-security-my-corona-edition
INFO:root:rank=4 pagerank=0.11724846810102463 url=www.lawfareblog.com/brexit-not-immune-coronavirus
INFO:root:rank=5 pagerank=0.08947845548391342 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=6 pagerank=0.08648215234279633 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=7 pagerank=0.08648215234279633 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=8 pagerank=0.07284898310899734 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=9 pagerank=0.06962569057941437 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
You can see that there are many urls about concepts that are obviously related like "covid", "clinical trials", and "quarantine", but this algorithm also finds articles about Chinese propaganda and Trump's policy decisions. Both of these articles are highly relevant to coronavirus discussions, but a simple keyword search for corona or related terms would not find these articles.
Part 3:
let's experiment with a national security topic other than the coronavirus.
For example, we'll find out what articles are important to the cybersecurity
topic but do not contain the word cybersecurity
.
$ python3 pagerank.py --data=./lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='cybersecurity' --search_query='-cybersecurity'
INFO:root:rank=0 pagerank=0.43661659955978394 url=www.lawfareblog.com/doj-charges-two-former-twitter-employees-spying-saudis
INFO:root:rank=1 pagerank=0.3557959198951721 url=www.lawfareblog.com/cyberlaw-podcast-plumbing-depths-artificial-stupidity
INFO:root:rank=2 pagerank=0.3557959198951721 url=www.lawfareblog.com/cyberlaw-podcast-sandworm-and-grus-global-intifada
INFO:root:rank=3 pagerank=0.3546285033226013 url=www.lawfareblog.com/summary-whatsapp-suit-against-nso-group
INFO:root:rank=4 pagerank=0.3546285033226013 url=www.lawfareblog.com/lessons-so-far-whatsapp-v-nso
INFO:root:rank=5 pagerank=0.26024776697158813 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=6 pagerank=0.2235962450504303 url=www.lawfareblog.com/whatsapp-nso-group-lawsuit-and-limits-lawful-hacking
INFO:root:rank=7 pagerank=0.1855047196149826 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=8 pagerank=0.1855047196149826 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=9 pagerank=0.09454161673784256 url=www.lawfareblog.com/senate-examines-threats-homeland