Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing wasnt forced anymore since middleware order has changed #729

Closed
rengaw83 opened this issue Mar 18, 2021 · 23 comments · Fixed by #1018
Closed

Indexing wasnt forced anymore since middleware order has changed #729

rengaw83 opened this issue Mar 18, 2021 · 23 comments · Fixed by #1018

Comments

@rengaw83
Copy link
Contributor

rengaw83 commented Mar 18, 2021

Indexing was'nt forced anymore.

TYPO3 9.5.25
Crawler: 9.2.2

The urls are crawled, but the indexing wasn't forced.

In have debugged a lot, and in my case the problem is the current middleware registration order.
In #642 the order was changed, to fix issues with StaticFileCache extension.
See AOEpeople@0f7cb6a#diff-38337bad08776b0fe94f73f1cf471f8d184ec3f9d2e9254f7c2a275b83501e34

The middleware order (without staticfilecache installed) now looks like this:

    typo3/cms-frontend/timetracker
    typo3/cms-core/normalized-params-attribute
    typo3/cms-frontend/preprocessing
    typo3/cms-frontend/eid
    typo3/cms-frontend/tsfe
    typo3/cms-frontend/authentication
    typo3/cms-frontend/site
    typo3/cms-frontend/preview-simulator
    aoe/crawler/authentication
    typo3/cms-frontend/base-redirect-resolver
    typo3/cms-frontend/static-route-resolver
    typo3/cms-frontend/page-resolver
    typo3/cms-frontend/maintenance-mode
    typo3/cms-frontend/page-argument-validator
    typo3/cms-frontend/prepare-tsfe-rendering
    typo3/cms-frontend/shortcut-and-mountpoint-redirect
    typo3/cms-frontend/content-length-headers
    aoe/crawler/initialization
    typo3/cms-frontend/output-compression

The aoe/crawler/initialization middleware is loaded after the typo3/cms-frontend/prepare-tsfe-rendering now.
The crawler can't fill $GLOBALS['TSFE']->applicationData['tx_crawler'] before the TSFE checks the headerNoCache and so indexed_search cant disable cache output. so no page is generated and no content can be indexed.

Here typo3 9 checks if the page should be generated or the cache will be used
https://github.com/TYPO3/TYPO3.CMS/blob/9.5/typo3/sysext/frontend/Classes/Controller/TypoScriptFrontendController.php#L2498-L2519

Here indexed search checks if the crawler is enabled and a indexing is in progress
https://github.com/TYPO3/TYPO3.CMS/blob/9.5/typo3/sysext/indexed_search/Classes/Hook/TypoScriptFrontendHook.php#L25-L40

I have reverted AOEpeople@0f7cb6a and the indexing works like expected again.

The crawler/initialization can't be the last in middleware chain, it has to be before the tsfe rendering!
I do not use the staticfilecache, so i can't fix issues with the staticfilecache and create a pr.

Maybe this is related to #679, but i have created a new issue, becouse i do not have staticfilecache like bh-teufels has.

@tomasnorre
Copy link
Owner

Thanks for you detailed information, I'll see if I can find a solution for this.
It sounds like it's related to the issue as you mentioned, and the information about the revert is also very helpful.

Will get back to you when I have more information on this.

@tomasnorre tomasnorre added 3rd party ext Issue related to 3rd party extension e.g. News Bug Priority 2 TYPO3v10 TYPO3v9 labels Mar 18, 2021
@rengaw83
Copy link
Contributor Author

Have tested the same crawler setup in a TYPO3 10.4.14 Installation.
Same problem there, no indexing is forced.
After reverting the changes at the middleware order, indexing works like expected in TYPO3 10 too.

@tomasnorre
Copy link
Owner

Super. That's really valuable feedback.

I'll need to dig more into Middleware, it's unfortunately something that I don't know well enough yet.

@rengaw83
Copy link
Contributor Author

Thank you, no problem, do not rush.
I have corrected the crawler middleware order in my own extension, so for me there is no hurry.

@tomasnorre
Copy link
Owner

I'm afraid we have a chicken/egg problem here.

The AOEpeople@0f7cb6a moved the aoe/crawler/initialization to after the typo3/cms-frontend/tsfe (close to the very end), and this was needed to have the Static File Cache Working.

So moving this back in that position will be a regression, that introduces the bug from #642 again.

Any hints on this would be welcome.

Do you @CDRO or @bmack have an idea on how to solve this?

@CDRO
Copy link
Contributor

CDRO commented Apr 12, 2021

Could we split up the middlewares into one called initialization (as we have it now) and a pre-compression-initialization and extract the parts that are needed to where they need to be to both accomodate indexing (of course, this is the primary need) and static file cache (which in my opinion is secondary to the crawler and the indexing business but very important, nontheless)?

@tomasnorre
Copy link
Owner

I honestly don't know. I'm still missing some basics knowledge on the middleware concept, but in general I don't see a problem in splitting it if it would solve the problem. But still not sure if it will solve the problem or introduce a new one.

Can be because of conceptual missing information from my site.

@CDRO
Copy link
Contributor

CDRO commented Apr 12, 2021

We could discuss this over slack tomorrow afternoon, if you have some time, we can still summarize our talking points here.

@bmack
Copy link
Contributor

bmack commented Apr 12, 2021

Btw... it is possible to define restrictions in EXT:crawler for middlewares that do not exist in an installation (as far as I can remember): "Load after TSFE initialization, but before EXT:staticfilecache middleware XYZ") staticfilecache has very "lax" middleware orderings so we can push it in the right order (don't know from just reading this issue where EXT: staticfilecache should be loaded and where the crawler middleware should be loaded).

@tomasnorre
Copy link
Owner

Do you have a link to the documentation @bmack ?

@CDRO
Copy link
Contributor

CDRO commented Apr 13, 2021

So, we discussed it with @tomasnorre and found what we need to change.

Tomas prepares an according PR.

What we did is split the initialization to have a part that ensures that the content is always rendered, before we return the result status to the crawler runner.

tomasnorre added a commit that referenced this issue Apr 13, 2021
This middleware splits the content from the Crawler Initialization Middleware
to ensure that the content is written at the end.

This ensures that middleware that expects response-object gets the content
for the renders and the crawler queue gets the correct request status.

Resolves: #729
@tomasnorre
Copy link
Owner

@rengaw83 Could you please check if this PR fixes the problem for you?
https://github.com/AOEpeople/crawler/pull/735

tomasnorre added a commit that referenced this issue Apr 14, 2021
This middleware splits the content from the Crawler Initialization Middleware
to ensure that the content is written at the end.

This ensures that middleware that expects response-object gets the content
for the renders and the crawler queue gets the correct request status.

Resolves: #729
@rengaw83
Copy link
Contributor Author

hi @tomasnorre,
i have testet the branch but it does not work.

The position of the initialization have not changed and is still in the wrong position.
The initialization middleware has to be placed before the typo3/cms-frontend/prepare-tsfe-rendering.

The $GLOBALS['TSFE']->applicationData['tx_crawler'] stuff, which is set by the initialization middleware, has to be set before the prepare-tsfe-rendering middleware. Otherwise TYPO3 will provide cached content and indexed_search does not index cached content.

At the typo3/cms-frontend/prepare-tsfe-rendering middleware the nocache checks are performed and the indexed_search TSFE hook to disable the caching kicks in.

indexed_search needs the crawler initialisation to disable the cache for the request at this point.

Its the same problem like #610.

@tomasnorre
Copy link
Owner

@rengaw83 Thanks for getting back to me. We thought splitting it into the ContentFinisher the problem would be solved.
I'll go back to the drawing board.

Don't want to break anything in the StaticFileCache with a fix in the Crawler.

@rengaw83
Copy link
Contributor Author

I understand, it's a bit tricky with the different dependencies, i guess.
I don't know what's the problem with staticfilecache, have never used it.
But the thing is, currently the crawler does not work if the staticfilecache extension is not installed.

I have added a RequestMiddlewares.php in my project extension and take care of the correct middleware position by myself.
With these restrictions, everything works as expected:

return [
    'frontend' => [
        'aoe/crawler/initialization' => [
            'target' => \AOE\Crawler\Middleware\CrawlerInitialization::class,
            'after' => [
                'typo3/cms-frontend/tsfe',
            ],
            'before' => [
                'typo3/cms-frontend/prepare-tsfe-rendering',
            ],
        ],
    ],
];

@tomasnorre
Copy link
Owner

Thanks will have a look at it.

@stale
Copy link

stale bot commented Dec 21, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Dec 21, 2022
@tomasnorre tomasnorre removed the staled label Dec 21, 2022
@stale
Copy link

stale bot commented Feb 20, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Feb 20, 2023
@tomasnorre tomasnorre removed the staled label Feb 21, 2023
@stale
Copy link

stale bot commented Apr 22, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Apr 22, 2023
@tomasnorre tomasnorre removed the staled label Apr 24, 2023
@cweiske
Copy link
Contributor

cweiske commented May 25, 2023

Since commit 8a9b896 (#837) the "real" response is returned by the CrawlerInitialization middleware. The information about the crawling/indexing process is transferred via a HTTP header now.

The reason for the patch 0f7cb6a (#642) that broke the middleware order is gone, and we can restore the proper loading order now.

cweiske added a commit to mogic-le/t3x-crawler that referenced this issue May 25, 2023
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue tomasnorre#642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue tomasnorre#837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug tomasnorre#729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for tomasnorre#642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: tomasnorre#729
cweiske added a commit to mogic-le/t3x-crawler that referenced this issue May 25, 2023
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue tomasnorre#642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue tomasnorre#837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug tomasnorre#729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for tomasnorre#642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: tomasnorre#729
cweiske added a commit to mogic-le/t3x-crawler that referenced this issue May 26, 2023
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue tomasnorre#642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue tomasnorre#837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug tomasnorre#729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for tomasnorre#642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: tomasnorre#729
@stale
Copy link

stale bot commented Jul 25, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the staled label Jul 25, 2023
@tomasnorre tomasnorre removed the staled label Jul 25, 2023
@cweiske
Copy link
Contributor

cweiske commented Jul 26, 2023 via email

@tomasnorre
Copy link
Owner

I'm considering disabling it.

tomasnorre pushed a commit that referenced this issue Feb 13, 2024
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue #642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue #837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug #729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for #642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: #729
cweiske added a commit to mogic-le/t3x-crawler that referenced this issue Feb 13, 2024
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue tomasnorre#642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue tomasnorre#837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug tomasnorre#729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for tomasnorre#642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: tomasnorre#729
tomasnorre pushed a commit that referenced this issue Feb 13, 2024
History:
--------
Because of a problem with lochmueller/staticfilecache,
crawler issue #642
changed the middleware loading order to execute crawler after static file cache.
(commit 0f7cb6a)

The source of the problem was that the crawler CrawlerInitialization middleware
overwrote the HTTP response that was generated by TYPO3.

Since commit 8a9b896
(issue #837)
the HTTP response is not destroyed/overwritten by crawler anymore
but moved into a HTTP header "X-T3Crawler-Meta".
The loading order does not influence compatibility with
static file cache anymore.

Bug
---
The changed loading order in the bug fix led to the problem that
> indexed_search:TypoScriptFrontendHook
was executed before
> crawler:CrawlerInitialization

But CrawlerInitialization must be run before TypoScriptFrontendHook
because it loads request data that are needed by indexed_search.

This led to bug #729
- forced reindexing by the crawler did not work anymore if the
page was already in cache.

Solution
--------
Restore the HTTP middleware loading order as it was before
the fix for #642, so that the code path is again:

1. crawler:FrontendUserAuthenticator
   (aoe/crawler/authentication)

2. crawler:CrawlerInitialization
   (aoe/crawler/initialization)

3. indexed_search:TypoScriptFrontendHook
   (called by typo3/cms-frontend/prepare-tsfe-rendering)

Resolves: #729
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants