Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Use organization to get a cleaner affiliation name #8

Closed
zhijian-liu opened this issue Jul 31, 2024 · 13 comments
Closed

[Proposal] Use organization to get a cleaner affiliation name #8

zhijian-liu opened this issue Jul 31, 2024 · 13 comments

Comments

@zhijian-liu
Copy link

zhijian-liu commented Jul 31, 2024

Thanks so much for your excellent work! I wanted to suggest a potential way to obtain a cleaner affiliation name. This approach involves using the organization field returned by scholarly.search_author_id, which represents the organization ID of the author. This ID is typically set only if the author has listed the organization in their affiliation and has a verified email under that organization. As a result, this method should yield a cleaner author affiliation with higher precision, though possibly lower recall.

Once you have this ID, you can scrape the organization name from Google Scholar. Here’s a sample code snippet to do this:

def get_organization_name(oid: str, scraper_api_key: Optional[str]) -> str:
    url = f"https://scholar.google.com/citations?view_op=view_org&org={oid}&hl=en"

    if scraper_api_key is not None:
        params = urlencode({"api_key": scraper_api_key, "url": url})
        response = requests.get("http://api.scraperapi.com/", params=params)
    else:
        response = requests.get(url)

    if response.status_code != 200:
        raise Exception(f"Failed to fetch {url}: {response.text}")

    soup = BeautifulSoup(response.text, "html.parser")
    tag = soup.find("h2", {"class": "gsc_authors_header"})
    if not tag:
        raise Exception(f"Failed to parse {url}")
    return tag.text.replace("Learn more", "").strip()

In this example, I use ScraperAPI to bypass the anti-bot checks, but there might be other options available as well.

@ChenLiu-1996
Copy link
Owner

  1. Regarding the organization field: Good idea. Maybe we can augment the string parsing with the organization field.
  2. Regarding ScraperAPI: I was aware of it but moved away from it because it is a paid service after the trial period.

@zhijian-liu
Copy link
Author

Here is an end-to-end implementation of the idea mentioned (where I'm not using multiprocessing):

import os
from typing import Optional, Any
from collections import defaultdict

import folium
from geopy.geocoders import Nominatim
from scholarly import ProxyGenerator, scholarly
from tqdm import tqdm
from urllib.parse import urlencode
import pickle
import requests
from bs4 import BeautifulSoup
from argparse import ArgumentParser


def save(data: Any, fpath: str) -> None:
    os.makedirs(os.path.dirname(fpath), exist_ok=True)
    with open(fpath, "wb") as fd:
        pickle.dump(data, fd)


def load(fpath: str) -> Any:
    with open(fpath, "rb") as fd:
        return pickle.load(fd)


def get_organization_name(oid: str, scraper_api_key: Optional[str]) -> str:
    url = f"https://scholar.google.com/citations?view_op=view_org&org={oid}&hl=en"

    if scraper_api_key is not None:
        params = urlencode({"api_key": scraper_api_key, "url": url})
        response = requests.get("http://api.scraperapi.com/", params=params)
    else:
        response = requests.get(url)

    if response.status_code != 200:
        raise Exception(f"Failed to fetch {url}: {response.text}")

    soup = BeautifulSoup(response.text, "html.parser")
    tag = soup.find("h2", {"class": "gsc_authors_header"})
    if not tag:
        raise Exception(f"Failed to parse {url}")
    return tag.text.replace("Learn more", "").strip()


def main() -> None:
    parser = ArgumentParser()
    parser.add_argument("--scholar-id", type=str, default="mwzYYPgAAAAJ")
    parser.add_argument("--scraper-api-key", type=str, default=None)
    parser.add_argument("--cache-dir", type=str, default="cache")
    args = parser.parse_args()

    # Set up proxy
    if args.scraper_api_key is not None:
        pg = ProxyGenerator()
        pg.ScraperAPI(args.scraper_api_key)
        scholarly.use_proxy(pg)

    # Fetch author's publications
    cache_path = os.path.join(args.cache_dir, "publications.pkl")
    if not os.path.exists(cache_path):
        author = scholarly.search_author_id(args.scholar_id)
        author = scholarly.fill(author, sections=["publications"])
        save(author["publications"], cache_path)
    publications = load(cache_path)

    # Fetch citations for each publication
    citations = defaultdict(list)
    for publication in tqdm(publications[::-1]):
        pid = publication["author_pub_id"].split(":")[1]
        cache_path = os.path.join(args.cache_dir, "citations", f"{pid}.pkl")
        if not os.path.exists(cache_path):
            for cid in publication.get("cites_id", []):
                citations[pid].extend(scholarly.search_citedby(cid))
            save(citations[pid], cache_path)
        citations[pid] = load(cache_path)

    # De-duplicate author IDs
    aids = []
    for pids in citations:
        for citation in citations[pids]:
            aids.extend(citation["author_id"])
    aids = sorted([aid for aid in set(aids) if aid])
    print("Total number of unique authors:", len(aids))

    # Fetch author profiles
    authors = {}
    for aid in tqdm(aids):
        cache_path = os.path.join(args.cache_dir, "authors", f"{aid}.pkl")
        if not os.path.exists(cache_path):
            authors[aid] = scholarly.search_author_id(aid)
            save(authors[aid], cache_path)
        authors[aid] = load(cache_path)

    # De-duplicate organization IDs
    oids = []
    for aid in authors:
        if "organization" in authors[aid]:
            oids.append(authors[aid]["organization"])
    oids = sorted(set(oids))
    print("Total number of unique organizations:", len(oids))

    # Set up geolocator
    locator = Nominatim(user_agent="citation_mapper")

    # Fetch organization details
    organizations = defaultdict(dict)
    for oid in tqdm(oids):
        cache_path = os.path.join(args.cache_dir, "organizations", f"{oid}.pkl")
        if not os.path.exists(cache_path):
            organizations[oid]["name"] = name = get_organization_name(
                oid, scraper_api_key=args.scraper_api_key
            )
            location = locator.geocode(name)
            if location is not None:
                organizations[oid]["coordinate"] = coordinate = (
                    location.latitude,
                    location.longitude,
                )
                organizations[oid]["address"] = locator.reverse(
                    coordinate, language="en"
                ).raw["address"]
            save(organizations[oid], cache_path)
        organizations[oid] = load(cache_path)

    # Generate citation map
    cmap = folium.Map(location=[20, 0], zoom_start=3)
    for organization in organizations.values():
        name = organization["name"]
        if "coordinate" in organization:
            folium.Marker(organization["coordinate"], popup=name).add_to(cmap)
    cmap.save("citations.html")


if __name__ == "__main__":
    main()

@ChenLiu-1996
Copy link
Owner

Thanks for the great input.

I haven't looked thoroughly into this but overall it looks like a valid (and possibly) better implementation. I will try to investigate and incorporate the changes at some point soon-ish.

In case this turns out to be a better solution, I will add you to the contributor of the repository and give you credit for that.

@ChenLiu-1996
Copy link
Owner

ChenLiu-1996 commented Aug 2, 2024

I was reading your code. Just a minor (perhaps quite irrelevant) thing:

Based on my personal experience, scholarly.search_citedby is a bad function that gave me a lot of Google Scholar blockings and slow executions. I was able to run much faster and avoid many blacklisting by moving away from it.

@ChenLiu-1996
Copy link
Owner

ChenLiu-1996 commented Aug 2, 2024

I have tried your method. I realized I have previously investigated the organization field which is an organization ID unique to Google Scholar.

I appreciate it that you wrote the function to extract the organization string from the organization ID, however it does not give us a reliable and consistent name for the same organization.

Another idea is to use the verified email domain to infer the affiliation.
Pros

  1. [Important] The most reliable method to get consistent affiliation names.
  2. Easily accessible and does not require any additional querying or webscraping beyond the existing code.

Cons

  1. Need a mapping between the organization email domains to the name of organizations.
  2. [Important] May underestimate the citing institutions by a lot, since many authors do not verify via email.

@zhijian-liu
Copy link
Author

zhijian-liu commented Aug 2, 2024

Thank you for your response. Based on my understanding, the organization field in Google Scholar is set very strictly. This field will only be populated if you have both an email address verified with the organization and have listed it in your affiliation. For example, I have listed two organizations in my affiliation, but my email address is verified with only one of them. Therefore, the organization ID is set to that one. On the website, this is visualized by an underline for the affiliation with a verified email address.

Based on the organization ID, you can navigate to the organization page. For instance, on the NVIDIA page, you can see people list their affiliations in various ways (e.g., NVIDIA, NVIDIA Research, nVidia), but their organization IDs are all the same. From this page, we can obtain the normalized organization name, "NVIDIA Corporation" (found at the top of the page before "Learn more"). This is essentially what my function does.

@zhijian-liu
Copy link
Author

I believe this is stricter than your proposal -- someone might have a verified email address at the organization but not list it in the affiliation field, although this is likely rare. However, I agree that not all authors have this field set. This approach will provide a list with very high precision but potentially low recall.

@ChenLiu-1996
Copy link
Owner

Thanks a lot for the detailed explanation. In this case I will probably take this approach. I will investigate and incorporate the changes in the next few days.

Do you want to be added to the contributor list? Or otherwise what's the best way to acknowledge your credit, besides mentioning you in the comments and README?

@zhijian-liu
Copy link
Author

zhijian-liu commented Aug 2, 2024

That sounds great! Please let me know if you encounter any issues or come up with new ideas; I'd love to discuss them.

Thank you for generously offering to add me to the contributor list, but this is primarily your work -- a small acknowledgment would be more than enough.

@ChenLiu-1996
Copy link
Owner

ChenLiu-1996 commented Aug 2, 2024

Thank you for your valuable idea and implementation. I have integrated your proposal into my local codebase (haven't pushed yet) and it's working fine. I am still debating whether to make this update.

Pros:

  1. Very high precision on affiliation, and circumvents the need for affiliation name cleaning.

Cons:

  1. Concerning recall. Many true affiliations are overlooked (for example, Meta has not been recognized as an organization). Multiple affiliations of the same author is not allowed.

One other possibility is to maintain 2 separate versions, one more conservative than the other, and the users can choose between them.

@zhijian-liu
Copy link
Author

Thank you for the update. This seems to be a trade-off between precision and recall. Personally, I prioritize precision, but I agree that recall is also important at times. Could you provide some statistics on the captured affiliations using both approaches? How significant is the difference between them? Additionally, what percentage of the missed affiliations are actually valid? Thank you!

@ChenLiu-1996
Copy link
Owner

I believe I will make it an option for the users, as an input argument. They can choose between the conservative approach vs the inclusive approach.

I can try to get some statistics later and report it.

@ChenLiu-1996
Copy link
Owner

ChenLiu-1996 commented Aug 2, 2024

Incorporated in Version 4.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants