Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression - Text registry migration broken on master #3876

Closed
DP19 opened this issue Aug 15, 2023 · 12 comments
Closed

Regression - Text registry migration broken on master #3876

DP19 opened this issue Aug 15, 2023 · 12 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@DP19
Copy link
Contributor

DP19 commented Aug 15, 2023

What happened:

Updating from pre v0.12.0 to latest master causes records to be attempted to be deleted that don't exist yet.

msg="Del records: a-some-record.some-domain.com. TXT [\"heritage=external-dns,external-dns/owner=cluster-1,external-dns/resource=virtualservice/ns/vs\"] 300"
...
"googleapi: Error 404: The 'entity.change.deletions[a-some-record.some-domain.com.][TXT]' resource named 'a-some-record.some-domain.com. (TXT)' does not exist...

What you expected to happen:
The migration to the new txt records in v0.12.0 should apply without issue

How to reproduce it (as minimally and precisely as possible):
Run external dns in an environment pre v0.12.0 with --registry=txt
Run latest master with --registry=txt to see attempts to delete records that don't exist yet

Anything else we need to know?:

Admittedly I don't know if this is for all providers; but it is happening for the google provider in my environment. The names in the log output have been changed but it's the gist of what's happening.

I stepped through the code and I see that it seems to be related to the move to handle this migration via a ForceUpdate strategy in #3635

These two lines end up with the same records to change and in the txt registry code they both generate the new recordType-record TXT:
https://github.com/kubernetes-sigs/external-dns/blob/master/registry/txt.go#L277
https://github.com/kubernetes-sigs/external-dns/blob/master/registry/txt.go#L286

This ends up submitting a delete and recreate of the new TXT record when the recordType-record TXT entry doesn't exist yet.

Environment:

  • External-DNS version (use external-dns --version): master
  • DNS provider: google
  • Others: flags - --policy=sync --registry=txt --source=istio-virtualservice --domain-filter=some-domain.com --provider=google --txt-owner-id=cluster-1
@DP19 DP19 added the kind/bug Categorizes issue or PR as related to a bug. label Aug 15, 2023
@DP19
Copy link
Contributor Author

DP19 commented Aug 16, 2023

Wrote a simple test to show the issue

func TestTXTMigration(t *testing.T) {
	ctx := context.Background()
	p := inmemory.NewInMemoryProvider()
	p.CreateZone(testZone)
	p.ApplyChanges(ctx, &plan.Changes{
		Create: []*endpoint.Endpoint{
			newEndpointWithOwner("oldformat.test-zone.example.org", "foo.loadbalancer.com", endpoint.RecordTypeCNAME, ""),
			newEndpointWithOwner("oldformat.test-zone.example.org", "\"heritage=external-dns,external-dns/owner=owner\"", endpoint.RecordTypeTXT, ""),
			newEndpointWithOwner("oldformat2.test-zone.example.org", "bar.loadbalancer.com", endpoint.RecordTypeA, ""),
			newEndpointWithOwner("oldformat2.test-zone.example.org", "\"heritage=external-dns,external-dns/owner=owner\"", endpoint.RecordTypeTXT, ""),
		},
	})
	r, _ := NewTXTRegistry(p, "", "", "owner", time.Hour, "wc", []string{endpoint.RecordTypeCNAME, endpoint.RecordTypeA, endpoint.RecordTypeNS}, false, nil)
	records, err := r.Records(ctx)
	assert.NoError(t, err)

	testPlan := plan.Plan{
		Current: records,
		Desired: []*endpoint.Endpoint{
			{
				DNSName:    "oldformat.test-zone.example.org",
				Targets:    endpoint.Targets{"foo.loadbalancer.com"},
				RecordType: endpoint.RecordTypeCNAME,
			},
			{
				DNSName:    "oldformat2.test-zone.example.org",
				Targets:    endpoint.Targets{"bar.loadbalancer.com"},
				RecordType: endpoint.RecordTypeA,
			},
		},
		Policies:       []plan.Policy{&plan.SyncPolicy{}},
		ManagedRecords: []string{"A", "AAAA", "CNAME"},
	}

	txtMigrationChanges := testPlan.Calculate()

	err = r.ApplyChanges(ctx, txtMigrationChanges.Changes)
	assert.NoError(t, err)
}

This breaks in two places where it's trying to updateOld and New but these new txt record types don't exist.

@DP19
Copy link
Contributor Author

DP19 commented Sep 29, 2023

This should no longer be an issue once #3774 is released

@danil-smirnov
Copy link

danil-smirnov commented Nov 3, 2023

@DP19 We experience this issue when upgrading from 0.10.2 to 0.13.6:

{"level":"fatal","msg":"googleapi: Error 404: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist.\nMore details:\nReason: notFound, Message: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist...

Any workaround?

@DP19
Copy link
Contributor Author

DP19 commented Nov 3, 2023

@DP19 We experience this issue when upgrading from 0.10.2 to 0.13.6:

{"level":"fatal","msg":"googleapi: Error 404: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist.\nMore details:\nReason: notFound, Message: The 'entity.change.deletions[a-wg.infra.domain.tld.][TXT]' resource named 'a-wg.infra.domain.tld. (TXT)' does not exist...

Any workaround?

@danil-smirnov yes, upgrade to 0.13.5 first and let it make the new txt records. You should be able to upgrade to the lastest version after that.

@danil-smirnov
Copy link

danil-smirnov commented Nov 3, 2023

Thank you @DP19 , this helped!

However experiencing some errors in the logs now: #4018 (maybe unrelated)

@DP19
Copy link
Contributor Author

DP19 commented Nov 3, 2023

@danil-smirnov That might just be due to v0.13.5 not updating all of the txt records.

Can you run that version with the --once flag and see if there are any errors making updates for all of the records? This should only run a single sync and will make it easy to know if something is failing there

@danil-smirnov
Copy link

You are right @DP19 - I haven't applied a two-step upgrade to the Cloudflare instance. After I went through 0.13.5 error was fixed.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2024
@DP19 DP19 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 4, 2024
@alebedev87
Copy link
Contributor

@DP19 : why the upgrade should pass specifically by 0.13.5 version? 0.13.1, 0.13.2, 0.13.4 are not good?

@DP19
Copy link
Contributor Author

DP19 commented Jul 25, 2024

@DP19 : why the upgrade should pass specifically by 0.13.5 version? 0.13.1, 0.13.2, 0.13.4 are not good?

I think they are; the migration you put in place for those versions should work without issue. I think we just picked this version because it was the last release before the change that went in and caused this issue; letting other fixes / changes also run between 0.12.0 and 0.13.5.

@alebedev87
Copy link
Contributor

@DP19 : Thanks for the reply!

I think we just picked this version because it was the last release before the change that went in and caused this issue

Yeah, that's what I thought too.

However, I'm still puzzled about the reasoning for not fixing the force update and waiting for the "new new" format instead.

  1. The issue you filed here exists, needs a 2 steps migration
  2. The new format is still there (created by the forced update)
  3. The new new format is not merged

We could use your change until the new new format is available.

@DP19
Copy link
Contributor Author

DP19 commented Jul 26, 2024

@DP19 : Thanks for the reply!

I think we just picked this version because it was the last release before the change that went in and caused this issue

Yeah, that's what I thought too.

However, I'm still puzzled about the reasoning for not fixing the force update and waiting for the "new new" format instead.

  1. The issue you filed here exists, needs a 2 steps migration
  2. The new format is still there (created by the forced update)
  3. The new new format is not merged

We could use your change until the new new format is available.

@alebedev87 I believe he thought at the time it would be merged and in by the next release; I haven't looked at the code to see if it will still work now with all of the other changes but if someone wants to take a shot at that I'd be all for it 🤷🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
5 participants