Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a Repository Item w/ an Alias Does Weird Things #1095

Closed
seth-shaw-unlv opened this issue Apr 17, 2019 · 19 comments
Closed

Creating a Repository Item w/ an Alias Does Weird Things #1095

seth-shaw-unlv opened this issue Apr 17, 2019 · 19 comments
Milestone

Comments

@seth-shaw-unlv
Copy link
Contributor

Funny what pops up during testing sprints that you didn't expect:

If I create a Repository Item without an alias, the item gets indexed in Gemini, the triplestore, and Fedora just fine. I can update the item and all updates are persisted to the triplestore and Fedora. If I add an alias, Gemini keeps the old node URL and the new alias replaces the node url in Fedora's schema:sameAs. The triple-store, however, starts using the new alias as the URI for the node and all the old metadata for the original node URL just sits there.

If I create a Repository Item with an alias from the start, the alias is used in Gemini and is used in Fedora and the triplestore. If I ever change the alias the Gemini lookups start failing and Fedora will fail to update. The triplestore, as with the other case, simply starts using the new alias and leaves the old data sitting around.

What do we do about this, if anything, 🤷‍♂️ . Thoughts, @Islandora-CLAW/committers?

@rosiel
Copy link
Member

rosiel commented Apr 18, 2019

My tepid take: ignore aliases and use /node/foo as the node's URI. Aliases are for presentation, not keeping track of stuff.

@dannylamb
Copy link
Contributor

We should consistently use /node/1 unless we want trouble every time a user changes an alias. Trying to understand the situation fully, here. So everything stays the same except schema:sameAs, which changes to use the new alias?

For the most part we're using $entity->toUrl() everywhere, but there's definitely inconsistencies in the parameters in different situations. See https://github.com/Islandora-CLAW/islandora/blob/8.x-1.x/src/EventGenerator/EventGenerator.php#L29 vs https://github.com/Islandora-CLAW/islandora/blob/8.x-1.x/src/Plugin/ContextReaction/MappingUriPredicateReaction.php#L62. What gets put into the queue is not precisely what gets altered into the jsonld using context. I'm sure it's a subtle distinction like that that's causing this (and potentially other) issues.

@seth-shaw-unlv
Copy link
Contributor Author

@dannylamb, the only practical difference between those two examples is the addition of the ->setRouteParameter('_format', 'jsonld') which simply adds ?_format=jsonld to the end. 'canonical' is the default mode and the ['absolute' => TRUE] parameter of toUrl is equivalent to adding the setAbsolute() call (adds the domain to the URL). So, yeah, we could be more consistent, but it doesn't have a practical effect in this case.

The URL object is very insistent that it use the alias in the URL. The only function that gives us the internal path (/node/1) is Url::getInternalPath but it doesn't consider setAbsolute (to include the domain) or setRouteParameter (to include _format=jsonld). We would have to write our own helper function to get consistent external URLs using the internal path consistently. As far as Drupal is concerned, the internal path should stay internal.

@dannylamb
Copy link
Contributor

@seth-shaw-unlv++ I was not aware those were the defaults. Good example, eh? ^_^

To confirm my fears, I tried curl

$ curl -I localhost:8000/node/1

HTTP/1.1 200 OK
Date: Thu, 18 Apr 2019 15:45:41 GMT
Server: Apache/2.4.18 (Ubuntu)
X-Powered-By: PHP/7.1.28-1+ubuntu16.04.1+deb.sury.org+3
Cache-Control: must-revalidate, no-cache, private
Link: <http://purl.org/coar/resource_type/c_c513>; rel="tag"; title="Image"
Link: <http://localhost:8000/media/1>; rel="related"; title="Original File"
Link: <http://localhost:8000/media/2>; rel="related"; title="Service File"
Link: <http://localhost:8000/media/3>; rel="related"; title="Thumbnail Image"
Link: <http://localhost:8000/node/1?_format=jsonld>; rel="alternate"; type="application/ld+json"
Link: <http://localhost:8000/node/1?_format=json>; rel="alternate"; type="application/json"
Link: <http://localhost:8000/node/1>; rel="alternate"; hreflang="en"
Link: </node/1>; rel="canonical"
Link: </node/1>; rel="shortlink"
Link: </node/1>; rel="revision"
X-Drupal-Dynamic-Cache: MISS
X-UA-Compatible: IE=edge
Content-language: en
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: 
X-Generator: Drupal 8 (https://www.drupal.org)
X-Drupal-Cache: MISS
Content-Type: text/html; charset=UTF-8

give it an alias and...

$ curl -I localhost:8000/node/1

HTTP/1.1 200 OK
Date: Thu, 18 Apr 2019 15:50:13 GMT
Server: Apache/2.4.18 (Ubuntu)
X-Powered-By: PHP/7.1.28-1+ubuntu16.04.1+deb.sury.org+3
Cache-Control: must-revalidate, no-cache, private
Link: <http://purl.org/coar/resource_type/c_c513>; rel="tag"; title="Image"
Link: <http://localhost:8000/media/1>; rel="related"; title="Original File"
Link: <http://localhost:8000/media/2>; rel="related"; title="Service File"
Link: <http://localhost:8000/media/3>; rel="related"; title="Thumbnail Image"
Link: <http://localhost:8000/are-you-in-the-triplestore?_format=jsonld>; rel="alternate"; type="application/ld+json"
Link: <http://localhost:8000/are-you-in-the-triplestore?_format=json>; rel="alternate"; type="application/json"
Link: <http://localhost:8000/are-you-in-the-triplestore>; rel="alternate"; hreflang="en"
Link: </are-you-in-the-triplestore>; rel="canonical"
Link: </node/1>; rel="shortlink"
Link: </are-you-in-the-triplestore>; rel="revision"
X-Drupal-Dynamic-Cache: MISS
X-UA-Compatible: IE=edge
Content-language: en
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: 
X-Generator: Drupal 8 (https://www.drupal.org)
X-Drupal-Cache: MISS
Content-Type: text/html; charset=UTF-8

canonical is both relative and mutable. Which is definitely not in line with what I assumed canonical meant. I guess it's more like "this is my preferred url" or "this is my published url", but not "this is the url that can never change" 💩

So yeah, shared utility function for sure, if Drupal's not gonna give it to us willingly. Those urls are essentially ids, so we can't have them changing on us. It does raise the question of if we want to capture the alias in RDF, and how that alias plays along with the fedora vs drupal url thing we've got going on.

@seth-shaw-unlv
Copy link
Contributor Author

Yeah, rel=canonical is a flag to search engines saying "use this URL" in your index for this page/content (instead of other URLs for the same page/content due to GET parameters that don't impact content, redirects, mobile versions, etc.). Drupal assumes that if you give a node an alias, THAT is the URL search engines should direct users to. See also Yoast's "the ultimate guide."

I think we need to keep the alias in Fedora and the triplestore. For one, if we ever need to rebuild Drupal from a Fedora, we will want the alias stored there. For another, if we do let people query the triplestore, the alias is the URI they will expect to use (unless we start displaying the internal path URLs prominently); so we at least need to include a relationship with our internal path URI so people will know which URI we are actually using.

@rosiel
Copy link
Member

rosiel commented Apr 18, 2019

As far as I can tell, Drupal 8 (like 7 before it) allows you to access /node/x at /node/x as well as at /my-fancy-alias. It's just internally, when creating a link to /node/x, it'll replace it with /my-fancy-alias. Like in 7, I'd bet there's a module to force /node/x to switch over to its alias.

I'm really ... scared ... of treating Drupal aliases as "URIs". They're changeable and human-readable and follow no set schema.

This is off topic but kinda related... if we want to expose a triplestore based on Drupal, should we maybe pop off the ?_format=jsonld? Because like seth-shaw-unlv pointed out, it's not exactly what someone expects to query for or see. The URI of an object can be different from the URL of the document that describes it *puts down can opener and admires the pretty worms* ;)

@DiegoPino
Copy link
Contributor

DiegoPino commented Apr 18, 2019

You could always put the alias in a local identifier property right? Since aliases can pretty much change in Drupal without notifying the actual node of it (path-auto) they are dangerous creatures to be assumed constant or persistent. Also, you would have to dump their data (and they are not configuration entities right?) when moving to a new site or restoring and i'm pretty sure you won't be allowed to pre-create aliases before having the actual nodes ingested, so egg-dinosaur-egg-extinction dylema.

@seth-shaw-unlv
Copy link
Contributor Author

@DiegoPino, actually, aliases are the only thing you could keep migrating from one Drupal to another. You can can give Drupal content, including an alias, at creation but then Drupal decides what your internal path is. So, if you theoretically lost your Drupal site and had to rebuild from Fedora, your Gemini database and triplestore of Drupal URIs would be useless unless you repopulate the new Drupal giving it your content in the exact same order again (including dummy content for anything that was deleted for node IDs no longer being used).

If you really want a URI to persist from one Drupal to another AND throughout time... create an alias and then somehow lock it down so no one can change it.

@DiegoPino
Copy link
Contributor

@seth-shaw-unlv true. My experience varies there but maybe its my interpretation of your answer and the fact that you/i could have path-auto, which is in fact different way of aliasing.

When you say

including an alias, at creation

Yes, not saying you should not restore them or not keep the around. My fault about the precreation. No precreation, but creation at the same time is possible.
But then again, how do you let Islandora know that if you changed an alias your resources needs to be pushed to fedora, triple store, etc again? Is that already automatic? Maybe i'm not aware of that and still thinking about path-auto. if you had them in another field, well you can reuse them on reingest via REST (no jsonapi support for now) ,HAL allows you to request an alias for a new resource . Just checked so that is ok.

But still probably all your references/links inside the ecosystem happen via uuid and sadly many times via the uid so you need to keep your uids anyway. Even Views building depends on the uid/uuid. Not your path alias. So re-ingesting a set of nodes related to each other requires a lot of mangling, waiting for the node id to come back for the first request, (files!) if you are thinking about a restoring from scratch. There are ways, yes but sitll partial to this. I feel you need uid, uuid and alias, the whole package.

On the other side JSONAPI allows to define your entity UUID and that also allows you to avoid overwriting nodes but not your alias.

True too, the node id (uid) can not be ensured (except if you migrate your full table in your case here many tables) and jsonapi disallows setting it. But the uuid can be persisted on export, re-import, via jsonapi, etc. So also how you build your alias is an issue. And then you have also language based aliases. How do you handle that fact? Also, you can have many aliases, and when requesting one you will get always the most recent one. (i even remember on 7.x killing paths because a shortcut became the used alias and then title of an object gets stuck for ever....)

I think i mentioned this some time ago (like years ago) but it seems a better approach (and there i agree with you) that if using alias is the only thing you can control and persist and its important for you all, that every Object/node gets an alias automatically and you can all agree on that one being the real thing (remember islandora/pid). I'm on my side do that by generating one automatically (that becomes my "purl") based on the uuid. Maybe even simpler, you can agree that API interaction happens always on that single PURL and that the response includes all the aliases (many, many) so your fedora, triplestore etc always contains that one. And then well, UI, etc can do whatever it wants.

This is a good code read (pointing to 8.7 just in case) https://git.drupalcode.org/project/drupal/blob/8.7.x/core/lib/Drupal/Core/Path/AliasStorage.php

and kinda needed to understand how aliases are handled on storage = crud.

@DiegoPino
Copy link
Contributor

Oh, also, Aliases are going to change in the future
https://www.drupal.org/project/drupal/issues/2336597
Not sure if that affects anything there

@rosiel
Copy link
Member

rosiel commented Apr 18, 2019

Haha @DiegoPino you convinced me a UUID-based "purl"-like alias was the thing to do before you said that was your solution. If only there were fields on aliases so that we could mark these ones as "special". (I didn't read the entire Drupal issue you pointed to, but if the "5 months ago" stuff is anything like the "5 years ago" stuff, that is a serious possibility).

Restoring from scratch, with lots of stuff related to each other, is going to require a chain of migrations and migration lookups... if it's true that you can't set the UID (i.e. unique id, i.e. nid, tid, or uid?) during migrate, then you're gonna do lookups and I don't think the aliases (or lack thereof) make it any harder or easier.

@seth-shaw-unlv seth-shaw-unlv added this to the 1.0.0 milestone May 10, 2019
@dannylamb
Copy link
Contributor

Looking like we won't land this before release, but we can hide the alias fields in the form and document in the meantime.

@seth-shaw-unlv
Copy link
Contributor Author

@dannylamb, yeah, I suppose that will work for now.

When we do get to it, as it looks like we are going to rely on the internal path, I would like the JSON-LD serialization to use the internal path for the @id but also include a schema:sameas so search engines know to relate the two URIs together.

@whikloj
Copy link
Member

whikloj commented May 23, 2019

We might want to raise this with the Drupal community to see if there is a consistent way to get the actual node id instead of the canonical one.

@dannylamb
Copy link
Contributor

I refactored every instance where we generate urls into some basic utility functions in IslandoraUtils and just call those instead. Now that we're consistent, turns out this sorts itself out! PR pending once I touch up tests.

@dannylamb
Copy link
Contributor

@whikloj Calling Url::fromRoute on entity.node.canoncial will always give you the /node/* route. That gets mangled later on down the line to use the alias if you call url() on the entity.

I'm relying on that behaviour for now, but at least when it changes we only have to update the code in one spot.

@dannylamb
Copy link
Contributor

@seth-shaw-unlv I don't think this is a thing anymore since we've 'standardized' how we generate URLs that go into headers and the events and such. OK to close?

@seth-shaw-unlv
Copy link
Contributor Author

@dannylamb, my only hesitation to closing this is that we don't push the alias to Fedora in any form anymore. If someone wanted to rebuild an Islandora site just from their Fedora repo, they would lose any aliases they have. But, yeah, the root issue here (aliases breaking things) is resolved.

@dannylamb
Copy link
Contributor

@seth-shaw-unlv I made a ticket for publishing url aliases in RDF and am closing this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants