Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pulp doesn't handle conflicting package filenames properly #2678

Open
daviddavis opened this issue Jul 29, 2022 · 8 comments
Open

Pulp doesn't handle conflicting package filenames properly #2678

daviddavis opened this issue Jul 29, 2022 · 8 comments
Assignees

Comments

@daviddavis
Copy link
Contributor

daviddavis commented Jul 29, 2022

Version
pulpcore 3.21.0.dev
pulp-rpm 3.18.0.dev

Describe the bug
If you add two packages with the same filename to a repo, then only one gets published/distributed (good). However, both show up in primary.xml (not so good) and the checksum can potentially not match (bad).

To Reproduce
This is a bit of a contrived example but it demonstrates the problem. Basically you add two packages with the same filename to a repo and publish it.

API Commands
    wget https://fixtures.pulpproject.org/rpm-unsigned/bear-4.1-1.noarch.rpm
    wget https://fixtures.pulpproject.org/rpm-unsigned/camel-0.1-1.noarch.rpm

    http --form :24817/pulp/api/v3/content/rpm/packages/ file@bear-4.1-1.noarch.rpm relative_path=camel-0.1-1.noarch.rpm
    http --form :24817/pulp/api/v3/content/rpm/packages/ file@camel-0.1-1.noarch.rpm

    http :24817/pulp/api/v3/repositories/rpm/rpm/ name=test
    http :24817/pulp/api/v3/distributions/rpm/rpm/ name=test base_path=test repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/

    http :24817/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/modify/ add_content_units:='["/pulp/api/v3/content/rpm/packages/290113b4-0ca2-4bb4-8b5a-725d471d865a/", "/pulp/api/v3/content/rpm/packages/8df08649-2272-4006-8270-eee1d453edd2/"]'

    http :24817/pulp/api/v3/publications/rpm/rpm/ repository=/pulp/api/v3/repositories/rpm/rpm/4d390079-60ff-4916-b80f-51b6756f2757/

Expected behavior
When I view the packages I see only one (camel) which is expected:

http :24816/pulp/content/test/Packages/c/

When I get its checksum, I get the checksum for the package named formerly known as bear-4.1-1.noarch.rpm which is also acceptable:

http :24816/pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm | sha256sum
ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1

However, when I download the primary.xml, I see two entries: one for bear at Packages/b/bear-4.1-1.noarch.rpm (which doesn't exist) with a checksum of ceb0f0 and one for camel at Packages/c/camel-0.1-1.noarch.rpm which does exist but it has a checksum of c5c34 (which doesn't match the checksum of the actual package at /pulp/content/test/Packages/c/camel-0.1-1.noarch.rpm).

@daviddavis
Copy link
Contributor Author

daviddavis commented Jul 29, 2022

Here's a copy of the primary.xml:

<?xml version="1.0" encoding="UTF-8"?>
<metadata xmlns="http://linux.duke.edu/metadata/common" xmlns:rpm="http://linux.duke.edu/metadata/rpm" packages="2">
<package type="rpm">
  <name>bear</name>
  <arch>noarch</arch>
  <version epoch="0" ver="4.1" rel="1"/>
  <checksum type="sha256" pkgid="YES">ceb0f0bb58be244393cc565e8ee5ef0ad36884d8ba8eec74542ff47d299a34c1</checksum>
  <summary>A dummy package of bear</summary>
  <description>A dummy package of bear</description>
  <packager></packager>
  <url>http://tstrachota.fedorapeople.org</url>
  <time file="1659114662" build="1331831374"/>
  <size package="1846" installed="42" archive="296"/>
  <location href="Packages/b/bear-4.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor></rpm:vendor>
    <rpm:group>Internet/Applications</rpm:group>
    <rpm:buildhost>smqe-ws15</rpm:buildhost>
    <rpm:sourcerpm>bear-4.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range start="280" end="1697"/>
    <rpm:provides>
      <rpm:entry name="bear" flags="EQ" epoch="0" ver="4.1" rel="1"/>
    </rpm:provides>
  </format>
</package>
<package type="rpm">
  <name>camel</name>
  <arch>noarch</arch>
  <version epoch="0" ver="0.1" rel="1"/>
  <checksum type="sha256" pkgid="YES">c5c34e1843847990d2c79b936309d6257279e26f907e20f9f58073a14525e1ef</checksum>
  <summary>A dummy package of camel</summary>
  <description>A dummy package of camel</description>
  <packager></packager>
  <url>http://tstrachota.fedorapeople.org</url>
  <time file="1659114663" build="1331831360"/>
  <size package="1847" installed="42" archive="296"/>
  <location href="Packages/c/camel-0.1-1.noarch.rpm"/>
  <format>
    <rpm:license>GPLv2</rpm:license>
    <rpm:vendor></rpm:vendor>
    <rpm:group>Internet/Applications</rpm:group>
    <rpm:buildhost>smqe-ws15</rpm:buildhost>
    <rpm:sourcerpm>camel-0.1-1.src.rpm</rpm:sourcerpm>
    <rpm:header-range start="280" end="1697"/>
    <rpm:provides>
      <rpm:entry name="camel" flags="EQ" epoch="0" ver="0.1" rel="1"/>
    </rpm:provides>
  </format>
</package>
</metadata>

@daviddavis
Copy link
Contributor Author

FYI, it sounds like we'll not be allowing users to set relative_path on upload so the package should be named using nevra and thus, this isn't going to be a high priority for us.

@dralley dralley self-assigned this May 23, 2023
@pulpbot pulpbot moved this to In Progress in RH Pulp Kanban board May 23, 2023
@pulpbot
Copy link
Member

pulpbot commented May 23, 2023

@dralley
Copy link
Contributor

dralley commented Sep 11, 2023

K so I'm finally looking at this one. Here's the state of things.

Pulp has one repository-level uniqueness filter (not constraint in the database sense, it's enforced by logic). That filter as it currently exists is N + E + V + R + A + "location_href".

location_href is filled in with NEVRA if you don't manually override it using the "relative_path" parameter. But you can set that parameter to anything, which is not great, especially because it isn't part of the global package unique constraint (so whatever value you use during an upload will be used everywhere).

Scenarios where it behaves CORRECTLY:

  • If you add two packages with the same NEVRA and different checksums (one signed one unsigned) to one repository without setting the "relative_path", then the repo-unique-filter will match, and the repo will end up with only one, as the newer one will replace the older one.

  • If you add two packages with the same NEVRA and different checksums to one repository while using the same "relative_path" for each

Scenarios where it behaves KIND OF CORRECTLY:

  • If you add two packages with the same NEVRA (one signed one unsigned) to one repository with different "relative_path"s set, then both will be allowed into the same repository, but when publishing only one will appear in the metadata. The metadata will not match the metadata pulp describes for the repository version, though.
  • As per the fix we applied in this patch, if you provide a "relative_path" of bear-1.0-1.noarch.rpm when uploading the package camel-1.0-1.noarch.rpm, the package will be published under the name bear-1.0-1.noarch.rpm
    • This totally sucks and is stupid, but some people have done something comparable.

Scenarios where it behaves INCORRECTLY:

  • If you take two different-NEVRA packages and give them overlapping "relative_path" on purpose, then they will both be allowed in the same repository, and they will all appear in the metadata, but only one will "win" at content-distribution time.
    • So take the last example - if bear-1.0-1.noarch.rpm exists in the same repository, then the two will silently conflict and you might have issues w/ DNF.
    • The symptoms are the same as a previous bug we encountered, described here except that was worse because it happened whenever you used relative_path with a value that didn't exactly match the one expected based on the NEVRA

The general TL;DR is that using the "relative_path" parameter on the upload endpoint generally leads to suboptimal outcomes.

How to Fix:

  • We could remove the relative_path param from our API entirely or make it no-op. Technically against SEMVER, but if we consider it broken, there's wiggle room.
  • We could provide two repo uniqueness filters - one for NEVRA and one for location_href. But A) Pulp doesn't allow that currently and B) it doesn't resolve the problem that that package + checksum is already immutably tied to a BS relative_path. i.e. pulp_rpm doesn't allow packages with different filenames but same NEVRA+checksum #2931

We could do one or both but either is painful and not a complete fix. Fixing data that already exists is a can of worms I'm going to discuss on the BZ not here. That's a separate conversation.

Neither of these feel like great options? But I would probably prefer the former in the short-term.

@daviddavis
Copy link
Contributor Author

We don't allow regular users to set the relative_path for packages but I think we want to allow admins to do so--let me double check on this. Would having a config variable that defaults to False (e.g. ALLOW_UNSAFE_SETTING_OF_RPM_NAMES) be an option?

@ipanova
Copy link
Member

ipanova commented Sep 15, 2023

Given Daniels comment and analyses, I agree that providing relative_path on the upload creates more problems than benefits, where the benefits are still unclear.

I would be in favor of removing in the upload api the relative_path or making it no-op. We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.

@daviddavis everything that has in it's name unsafe implies no guarantee and ask for trouble for us to maintain somehow the corner case as for the user who will get non-deterministic results. Unless there is a justifiable strong usecase to keep relative_path on the upload I'd remove it/make it no-op.

@daviddavis
Copy link
Contributor Author

daviddavis commented Sep 15, 2023

I checked and we don't the ability to set filenames when uploading rpms.

We'd also then attempt to fix already existing data by properly setting the package.location_href and content_artifact.relative_path stored in DB.

This makes me a bit nervous. Would it be possible to let users opt and/or have a dry run option that would list packages that have bad location_hrefs/relative_paths?

@ipanova
Copy link
Member

ipanova commented Sep 18, 2023

@daviddavis That probably won't be a problem for the django management command

dralley added a commit to dralley/pulp_rpm that referenced this issue Jan 31, 2024
This will avoid situations where the same package can potentially
co-exist in a repo under different filenames.

closes pulp#2678
dralley added a commit to dralley/pulp_rpm that referenced this issue Feb 8, 2024
This will avoid situations where the same package can potentially
co-exist in a repo under different filenames.

closes pulp#2678
dralley added a commit to dralley/pulp_rpm that referenced this issue Feb 8, 2024
This will avoid situations where the same package can potentially
co-exist in a repo under different filenames.

closes pulp#2678
dralley added a commit to dralley/pulp_rpm that referenced this issue Feb 8, 2024
This will avoid situations where the same package can potentially
co-exist in a repo under different filenames.

closes pulp#2678
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

4 participants