Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch Python /simple and /<project>/ APIs to using the JSON-based format (PEP-691) #7680

Open
jeffwidman opened this issue Aug 1, 2023 · 3 comments
Labels
F: package-metadata The metadata that Dependabot fetched for the package L: python:pip Python packages via pip L:python:pip-compile Python packages via pip-compile L: python:pipenv Python packages via pipenv L: python:poetry Python packages via poetry python Dependabot pull requests that update Python code T: tech-debt ⚙️

Comments

@jeffwidman
Copy link
Member

jeffwidman commented Aug 1, 2023

Code improvement description

This is mostly a brain dump of a bunch of research I did this evening around the current state of PyPI APIs as part of #5723 and whether any changes can/should be made in :dependabot: :

Warehouse / PyPI exposes several JSON-based APIs:

/simple

  • Provides an index of all packages.
  • Defaults to HTML
  • Can be requested in JSON format, as codified in PEP-691.
  • The HTML version is currently used by Dependabot for fetching available versions:
    # See https://www.python.org/dev/peps/pep-0503/ for details of the
    # Simple Repository API we use here.
    def available_versions
    @available_versions ||=
    index_urls.flat_map do |index_url|
    sanitized_url = index_url.gsub(%r{(?<=//).*(?=@)}, "redacted")
    index_response = registry_response_for_dependency(index_url)
    if index_response.status == 401 || index_response.status == 403
    registry_index_response = registry_index_response(index_url)
    if registry_index_response.status == 401 || registry_index_response.status == 403
    raise PrivateSourceAuthenticationFailure, sanitized_url
    end
    end
    version_links = []
    index_response.body.scan(%r{<a\s.*?>.*?</a>}m) do
    details = version_details_from_link(Regexp.last_match.to_s)
    version_links << details if details
    end
    version_links.compact
    rescue Excon::Error::Timeout, Excon::Error::Socket
    raise if MAIN_PYPI_INDEXES.include?(index_url)
    raise PrivateSourceTimedOut, sanitized_url
    rescue URI::InvalidURIError
    raise DependencyFileNotResolvable, "Invalid URL: #{sanitized_url}"
    end
    end
    # rubocop:disable Metrics/PerceivedComplexity
    def version_details_from_link(link)
    doc = Nokogiri::XML(link)
    filename = doc.at_css("a")&.content
    url = doc.at_css("a")&.attributes&.fetch("href", nil)&.value
    return unless filename&.match?(name_regex) || url&.match?(name_regex)
    version = get_version_from_filename(filename)
    return unless version_class.correct?(version)
    {
    version: version_class.new(version),
    python_requirement: build_python_requirement_from_link(link),
    yanked: link&.include?("data-yanked")
    }
    end
    # rubocop:enable Metrics/PerceivedComplexity
    def get_version_from_filename(filename)
    filename.
    gsub(/#{name_regex}-/i, "").
    split(/-|\.tar\.|\.zip|\.whl/).
    first
    end
    def build_python_requirement_from_link(link)
    req_string = Nokogiri::XML(link).
    at_css("a")&.
    attribute("data-requires-python")&.
    content
    return unless req_string
    requirement_class.new(CGI.unescapeHTML(req_string))
    rescue Gem::Requirement::BadRequirementError
    nil
    end
    def index_urls
    @index_urls ||=
    IndexFinder.new(
    dependency_files: dependency_files,
    credentials: credentials
    ).index_urls
    end
    def registry_response_for_dependency(index_url)
    Dependabot::RegistryClient.get(
    url: index_url + normalised_name + "/",
    headers: { "Accept" => "text/html" }
    )
    end
    def registry_index_response(index_url)
    Dependabot::RegistryClient.get(
    url: index_url,
    headers: { "Accept" => "text/html" }
    )
    end
  • Would be nice to migrate to the JSON variant as parsing static HTML is never fun
  • Probably blocked by lack of support in private registry implementations:

/<package-name>/

  • Provides some version details about the package
  • Can be requested in JSON format, as codified in PEP-691.
  • Used by Dependabot here:
    def registry_response_for_dependency(index_url)
    Dependabot::RegistryClient.get(
    url: index_url + normalised_name + "/",
    headers: { "Accept" => "text/html" }
    )
    end
  • Like /simple probably we can't migrate this to using the JSON API until/unless private registries support this.

/pypi/<package-name>/json

Conclusion:

  1. The one JSON API that's non-standard is the one we use in Dependabot, because that's the only way to retrieve the metadata information.
  2. The other APIs for fetching available versions now have a PEP standardizing how they should expose their data via JSON in addition to static HTML.
  3. Today Dependabot fetches these via static HTML.
  4. We are probably blocked for the foreseeable future from migrating those APIs to use JSON because it'd break all the private registries.
  5. And running both JSON and HTML parsing paths doesn't make a lot of sense, at least right now, because it adds complexity with no real benefit.
@jeffwidman jeffwidman added L: python:pip Python packages via pip L: python:poetry Python packages via poetry python Dependabot pull requests that update Python code T: tech-debt ⚙️ F: package-metadata The metadata that Dependabot fetched for the package L: python:pipenv Python packages via pipenv L:python:pip-compile Python packages via pip-compile labels Aug 1, 2023
@jeffwidman
Copy link
Member Author

jeffwidman commented Aug 1, 2023

This will likely have to sit on backlog for several years until PEP-691 (adopted last year) sees more widespread adoption.

@jeffwidman
Copy link
Member Author

Related ticket with more technical info:

@bewinsnw
Copy link

bewinsnw commented May 9, 2024

FWIW, dependabot is already broken on simple-only indexes. We're currently trying to use dependabot with an internal artifactory (with a simple index, no json) and dependabot tries to use json. I think the request is coming from here:

https://github.com/dependabot/dependabot-core/blob/e5ec7e979/python/lib/dependabot/python/update_checker.rb#L270

But with config like this:

version: 2
registries:
  python-artifactory:
    type: python-index
    url: https://redacted-internal-server-name/artifactory/api/pypi/pypi/simple/
    replaces-base: true
updates:
  - package-ecosystem: "pip"
    # dependabot will not run without this
    # https://docs.github.com/en/code-security/dependabot/dependabot-version-updates/configuration-options-for-the-dependabot.yml-file#insecure-external-code-execution
    insecure-external-code-execution: allow
    directory: "/"
    registries:
      - python-artifactory
    schedule:
      interval: "daily"

We get errors like this in the dependabot on internal runners log:

  proxy | 2024/05/09 11***17***05 [021] GET https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
2024/05/09 11***17***05 [021] 200 https***//redacted-internal-server-name***443/artifactory/api/pypi/pypi/simple/smart-open/
updater | 2024/05/09 11***17***05 INFO <job_826040134> Filtered out 2 pre-release versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Requirements to unlock own
2024/05/09 11***17***05 INFO <job_826040134> Requirements update strategy bump_versions
updater | 2024/05/09 11***17***05 INFO <job_826040134> Updating smart-open from 5.2.1 to 7.0.4
  proxy | 2024/05/09 11***17***06 [023] GET https***//redacted-internal-server-name***443/pypi/smart-open/json
  proxy | 2024/05/09 11***17***06 [023] 404 https***//redacted-internal-server-name***443/pypi/smart-open/json

Note those last two urls. dependabot is not respecting the config where we've told it the main url to use, but is inventing the path prefix to look for the index. However, since this index is a simple index, even if dependabot was using the correct path prefix, that would 404.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
F: package-metadata The metadata that Dependabot fetched for the package L: python:pip Python packages via pip L:python:pip-compile Python packages via pip-compile L: python:pipenv Python packages via pipenv L: python:poetry Python packages via poetry python Dependabot pull requests that update Python code T: tech-debt ⚙️
Projects
None yet
Development

No branches or pull requests

2 participants