-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proposal] Use organization
to get a cleaner affiliation name
#8
Comments
|
Here is an end-to-end implementation of the idea mentioned (where I'm not using multiprocessing): import os
from typing import Optional, Any
from collections import defaultdict
import folium
from geopy.geocoders import Nominatim
from scholarly import ProxyGenerator, scholarly
from tqdm import tqdm
from urllib.parse import urlencode
import pickle
import requests
from bs4 import BeautifulSoup
from argparse import ArgumentParser
def save(data: Any, fpath: str) -> None:
os.makedirs(os.path.dirname(fpath), exist_ok=True)
with open(fpath, "wb") as fd:
pickle.dump(data, fd)
def load(fpath: str) -> Any:
with open(fpath, "rb") as fd:
return pickle.load(fd)
def get_organization_name(oid: str, scraper_api_key: Optional[str]) -> str:
url = f"https://scholar.google.com/citations?view_op=view_org&org={oid}&hl=en"
if scraper_api_key is not None:
params = urlencode({"api_key": scraper_api_key, "url": url})
response = requests.get("http://api.scraperapi.com/", params=params)
else:
response = requests.get(url)
if response.status_code != 200:
raise Exception(f"Failed to fetch {url}: {response.text}")
soup = BeautifulSoup(response.text, "html.parser")
tag = soup.find("h2", {"class": "gsc_authors_header"})
if not tag:
raise Exception(f"Failed to parse {url}")
return tag.text.replace("Learn more", "").strip()
def main() -> None:
parser = ArgumentParser()
parser.add_argument("--scholar-id", type=str, default="mwzYYPgAAAAJ")
parser.add_argument("--scraper-api-key", type=str, default=None)
parser.add_argument("--cache-dir", type=str, default="cache")
args = parser.parse_args()
# Set up proxy
if args.scraper_api_key is not None:
pg = ProxyGenerator()
pg.ScraperAPI(args.scraper_api_key)
scholarly.use_proxy(pg)
# Fetch author's publications
cache_path = os.path.join(args.cache_dir, "publications.pkl")
if not os.path.exists(cache_path):
author = scholarly.search_author_id(args.scholar_id)
author = scholarly.fill(author, sections=["publications"])
save(author["publications"], cache_path)
publications = load(cache_path)
# Fetch citations for each publication
citations = defaultdict(list)
for publication in tqdm(publications[::-1]):
pid = publication["author_pub_id"].split(":")[1]
cache_path = os.path.join(args.cache_dir, "citations", f"{pid}.pkl")
if not os.path.exists(cache_path):
for cid in publication.get("cites_id", []):
citations[pid].extend(scholarly.search_citedby(cid))
save(citations[pid], cache_path)
citations[pid] = load(cache_path)
# De-duplicate author IDs
aids = []
for pids in citations:
for citation in citations[pids]:
aids.extend(citation["author_id"])
aids = sorted([aid for aid in set(aids) if aid])
print("Total number of unique authors:", len(aids))
# Fetch author profiles
authors = {}
for aid in tqdm(aids):
cache_path = os.path.join(args.cache_dir, "authors", f"{aid}.pkl")
if not os.path.exists(cache_path):
authors[aid] = scholarly.search_author_id(aid)
save(authors[aid], cache_path)
authors[aid] = load(cache_path)
# De-duplicate organization IDs
oids = []
for aid in authors:
if "organization" in authors[aid]:
oids.append(authors[aid]["organization"])
oids = sorted(set(oids))
print("Total number of unique organizations:", len(oids))
# Set up geolocator
locator = Nominatim(user_agent="citation_mapper")
# Fetch organization details
organizations = defaultdict(dict)
for oid in tqdm(oids):
cache_path = os.path.join(args.cache_dir, "organizations", f"{oid}.pkl")
if not os.path.exists(cache_path):
organizations[oid]["name"] = name = get_organization_name(
oid, scraper_api_key=args.scraper_api_key
)
location = locator.geocode(name)
if location is not None:
organizations[oid]["coordinate"] = coordinate = (
location.latitude,
location.longitude,
)
organizations[oid]["address"] = locator.reverse(
coordinate, language="en"
).raw["address"]
save(organizations[oid], cache_path)
organizations[oid] = load(cache_path)
# Generate citation map
cmap = folium.Map(location=[20, 0], zoom_start=3)
for organization in organizations.values():
name = organization["name"]
if "coordinate" in organization:
folium.Marker(organization["coordinate"], popup=name).add_to(cmap)
cmap.save("citations.html")
if __name__ == "__main__":
main() |
Thanks for the great input. I haven't looked thoroughly into this but overall it looks like a valid (and possibly) better implementation. I will try to investigate and incorporate the changes at some point soon-ish. In case this turns out to be a better solution, I will add you to the contributor of the repository and give you credit for that. |
I was reading your code. Just a minor (perhaps quite irrelevant) thing: Based on my personal experience, |
I have tried your method. I realized I have previously investigated the I appreciate it that you wrote the function to extract the organization string from the organization ID, however it does not give us a reliable and consistent name for the same organization. Another idea is to use the verified email domain to infer the affiliation.
Cons
|
Thank you for your response. Based on my understanding, the organization field in Google Scholar is set very strictly. This field will only be populated if you have both an email address verified with the organization and have listed it in your affiliation. For example, I have listed two organizations in my affiliation, but my email address is verified with only one of them. Therefore, the organization ID is set to that one. On the website, this is visualized by an underline for the affiliation with a verified email address. Based on the organization ID, you can navigate to the organization page. For instance, on the NVIDIA page, you can see people list their affiliations in various ways (e.g., NVIDIA, NVIDIA Research, nVidia), but their organization IDs are all the same. From this page, we can obtain the normalized organization name, "NVIDIA Corporation" (found at the top of the page before "Learn more"). This is essentially what my function does. |
I believe this is stricter than your proposal -- someone might have a verified email address at the organization but not list it in the affiliation field, although this is likely rare. However, I agree that not all authors have this field set. This approach will provide a list with very high precision but potentially low recall. |
Thanks a lot for the detailed explanation. In this case I will probably take this approach. I will investigate and incorporate the changes in the next few days. Do you want to be added to the contributor list? Or otherwise what's the best way to acknowledge your credit, besides mentioning you in the comments and README? |
That sounds great! Please let me know if you encounter any issues or come up with new ideas; I'd love to discuss them. Thank you for generously offering to add me to the contributor list, but this is primarily your work -- a small acknowledgment would be more than enough. |
Thank you for your valuable idea and implementation. I have integrated your proposal into my local codebase (haven't pushed yet) and it's working fine. I am still debating whether to make this update. Pros:
Cons:
One other possibility is to maintain 2 separate versions, one more conservative than the other, and the users can choose between them. |
Thank you for the update. This seems to be a trade-off between precision and recall. Personally, I prioritize precision, but I agree that recall is also important at times. Could you provide some statistics on the captured affiliations using both approaches? How significant is the difference between them? Additionally, what percentage of the missed affiliations are actually valid? Thank you! |
I believe I will make it an option for the users, as an input argument. They can choose between the conservative approach vs the inclusive approach. I can try to get some statistics later and report it. |
Incorporated in Version 4.0. |
Thanks so much for your excellent work! I wanted to suggest a potential way to obtain a cleaner affiliation name. This approach involves using the
organization
field returned byscholarly.search_author_id
, which represents the organization ID of the author. This ID is typically set only if the author has listed the organization in their affiliation and has a verified email under that organization. As a result, this method should yield a cleaner author affiliation with higher precision, though possibly lower recall.Once you have this ID, you can scrape the organization name from Google Scholar. Here’s a sample code snippet to do this:
In this example, I use ScraperAPI to bypass the anti-bot checks, but there might be other options available as well.
The text was updated successfully, but these errors were encountered: