Skip to content

Commit

Permalink
Extends headless and sitemap transformers to pass app URL through tra…
Browse files Browse the repository at this point in the history
…nsformer parameter
  • Loading branch information
uarlouski committed Jan 12, 2024
1 parent 5296feb commit e9b0a63
Show file tree
Hide file tree
Showing 10 changed files with 165 additions and 41 deletions.
40 changes: 34 additions & 6 deletions docs/modules/plugins/pages/plugin-web-app-to-rest-api.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,20 @@ include::partial$plugin-installation.adoc[]

`FROM_SITEMAP` transformer generates table based on the website sitemap.

IMPORTANT: The use of `web-application.main-page-url` property for setting of main page for crawling is deprecated and will be removed in VIVIDUS 0.7.0, pelase see either `mainPageUrl` transformer parameter or `transformer.from-sitemap.main-page-url` property.

[cols="1,3", options="header"]

|===

|Parameter
|Description

|`mainPageUrl`
a|main application page URL, used as initial seed URL that is fetched by the crawler to extract new URLs in it and follow them for crawling.

IMPORTANT: The main page url value defined by this parameter overrides the value defined by the `transformer.from-sitemap.main-page-url` property.

|`siteMapRelativeUrl`
|relative URL of `sitemap.xml`

Expand All @@ -26,15 +35,25 @@ include::partial$plugin-installation.adoc[]

|`column`
|the column name in the generated table

|===

[cols="3,1,1,3", options="header"]

|===

|Property Name
|Acceptable values
|Default
|Description

|`transformer.from-sitemap.main-page-url`
|`URL`
|
a|main application page URL, used as initial seed URL that is fetched by the crawler to extract new URLs in it and follow them for crawling.

IMPORTANT: The main page url value defined by this property gets overriden by the value defined in `mainPageUrl` transformer parameter.

|`transformer.from-sitemap.ignore-errors`
a|`true`
`false`
Expand All @@ -46,9 +65,8 @@ a|`true`
`false`
|`false`
|defines whether urls that has redirect to the one that has already been included in the table are excluded from the table

|===
==== Required properties
* `web-application.main-page-url` - defines main application page URL

.Usage example
----
Expand All @@ -60,12 +78,19 @@ Examples:

`FROM_HEADLESS_CRAWLING` transformer generates table based on the results of headless crawling.

IMPORTANT: The use of `web-application.main-page-url` property for setting of main page for crawling is deprecated and will be removed in VIVIDUS 0.7.0, pelase see either `mainPageUrl` transformer parameter or `transformer.from-headless-crawling.main-page-url` property.

[cols="1,3", options="header"]
|===

|Parameter Name
|Description

|`mainPageUrl`
a|main application page URL, used as initial seed URL that is fetched by the crawler to extract new URLs in it and follow them for crawling.

IMPORTANT: The main page url value defined by this parameter overrides the value defined by the `transformer.from-headless-crawling.main-page-url` property.

|`column`
|The column name in the generated table.

Expand All @@ -81,6 +106,13 @@ Examples:

4+^.^|_General_

|`transformer.from-headless-crawling.main-page-url`
|`URL`
|
a|main application page URL, used as initial seed URL that is fetched by the crawler to extract new URLs in it and follow them for crawling.

IMPORTANT: The main page url value defined by this property gets overriden by the value defined in `mainPageUrl` transformer parameter.

|`transformer.from-headless-crawling.seed-relative-urls`
|Comma-separated list of values
|
Expand Down Expand Up @@ -229,10 +261,6 @@ transformer.from-headless-crawling.http.headers.x-vercel-protection-bypass=1fac2

|===

==== Required properties

* `{main-page-url}` - defines main application page URL

.Usage example
----
Examples:
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2019-2023 the original author or authors.
* Copyright 2019-2024 the original author or authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -33,6 +33,7 @@
import org.vividus.transformer.ExtendedTableTransformer;
import org.vividus.ui.web.configuration.WebApplicationConfiguration;
import org.vividus.util.ExamplesTableProcessor;
import org.vividus.util.UriUtils;

public abstract class AbstractFetchingUrlsTableTransformer implements ExtendedTableTransformer
{
Expand All @@ -43,6 +44,7 @@ public abstract class AbstractFetchingUrlsTableTransformer implements ExtendedTa
private WebApplicationConfiguration webApplicationConfiguration;
private HttpRedirectsProvider httpRedirectsProvider;
private boolean filterRedirects;
private URI mainPageUrl;

@Override
public String transform(String tableAsString, TableParsers tableParsers, TableProperties properties)
Expand Down Expand Up @@ -101,6 +103,24 @@ private String build(Set<String> urls, TableProperties properties)
.buildExamplesTableFromColumns(List.of(columnName), List.of(urlsList), properties);
}

protected URI getMainApplicationPageUri(TableProperties properties)
{
URI uri = Optional.ofNullable(properties.getProperties().getProperty("mainPageUrl"))
.map(UriUtils::createUri)
.orElse(mainPageUrl);

if (uri == null)
{
uri = webApplicationConfiguration.getMainApplicationPageUrl();
logger.atWarn().addArgument("web-application.main-page-url")
.log("The use of {} property for setting of main page for crawling is deprecated and will "
+ "be removed in VIVIDUS 0.7.0, pelase see official documentation for corresponding"
+ " replacements.");
}

return uri;
}

public void setWebApplicationConfiguration(WebApplicationConfiguration webApplicationConfiguration)
{
this.webApplicationConfiguration = webApplicationConfiguration;
Expand All @@ -111,13 +131,13 @@ public void setHttpRedirectsProvider(HttpRedirectsProvider httpRedirectsProvider
this.httpRedirectsProvider = httpRedirectsProvider;
}

protected URI getMainApplicationPageUri()
public void setFilterRedirects(boolean filterRedirects)
{
return webApplicationConfiguration.getMainApplicationPageUrl();
this.filterRedirects = filterRedirects;
}

public void setFilterRedirects(boolean filterRedirects)
public void setMainPageUrl(URI mainPageUrl)
{
this.filterRedirects = filterRedirects;
this.mainPageUrl = mainPageUrl;
}
}
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2019-2023 the original author or authors.
* Copyright 2019-2024 the original author or authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -18,9 +18,10 @@

import java.net.URI;
import java.util.Set;
import java.util.function.Supplier;

import com.google.common.base.Suppliers;
import com.google.common.cache.CacheBuilder;
import com.google.common.cache.CacheLoader;
import com.google.common.cache.LoadingCache;

import org.apache.commons.lang3.StringUtils;
import org.jbehave.core.model.ExamplesTable.TableProperties;
Expand Down Expand Up @@ -52,23 +53,22 @@ public class HeadlessCrawlerTableTransformer extends AbstractFetchingUrlsTableTr
private String excludeUrlsRegex;
private String excludeExtensionsRegex;

private final Supplier<Set<String>> urlsProvider = Suppliers.memoize(() ->
{
if (!StringUtils.isEmpty(excludeExtensionsRegex))
{
LOGGER.warn(DEPRECATION_MESSAGE);
}

URI mainApplicationPage = getMainApplicationPageUri();
CrawlController controller = crawlControllerFactory.createCrawlController(mainApplicationPage);
private final LoadingCache<URI, Set<String>> crawledUrlsCache = CacheBuilder.newBuilder()
.build(new CacheLoader<>()
{
@Override
public Set<String> load(URI mainApplicationPage)
{
CrawlController controller = crawlControllerFactory.createCrawlController(mainApplicationPage);

addSeeds(mainApplicationPage, controller);
addSeeds(mainApplicationPage, controller);

LinkCrawlerData linkCrawlerData = new LinkCrawlerData();
controller.start(new LinkCrawlerFactory(linkCrawlerData, excludeUrlsRegex), NUMBER_OF_CRAWLERS);
Set<String> absoluteUrls = linkCrawlerData.getAbsoluteUrls();
return filterResults(absoluteUrls.stream());
});
LinkCrawlerData linkCrawlerData = new LinkCrawlerData();
controller.start(new LinkCrawlerFactory(linkCrawlerData, excludeUrlsRegex), NUMBER_OF_CRAWLERS);
Set<String> absoluteUrls = linkCrawlerData.getAbsoluteUrls();
return filterResults(absoluteUrls.stream());
}
});

private void addSeeds(URI mainApplicationPage, CrawlController controller)
{
Expand Down Expand Up @@ -102,7 +102,13 @@ private void addSeed(CrawlController controller, String pageUrl)
@Override
protected Set<String> fetchUrls(TableProperties properties)
{
return urlsProvider.get();
if (!StringUtils.isEmpty(excludeExtensionsRegex))
{
LOGGER.warn(DEPRECATION_MESSAGE);
}

URI mainApplicationPage = getMainApplicationPageUri(properties);
return crawledUrlsCache.getUnchecked(mainApplicationPage);
}

public void setCrawlControllerFactory(ICrawlControllerFactory crawlControllerFactory)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2019-2023 the original author or authors.
* Copyright 2019-2024 the original author or authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -48,7 +48,7 @@ public Set<String> fetchUrls(TableProperties properties)
.orElse(ignoreErrors);
String siteMapRelativeUrl = properties.getMandatoryNonBlankProperty("siteMapRelativeUrl", String.class);

URI mainApplicationPage = getMainApplicationPageUri();
URI mainApplicationPage = getMainApplicationPageUri(properties);
Set<String> urls = siteMaps.stream().filter(
s -> s.mainAppPage().equals(mainApplicationPage) && s.siteMapRelativeUrl().equals(siteMapRelativeUrl))
.findFirst().orElseGet(() ->
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,11 @@ sitemap.parser.site-url=
sitemap.parser.base-url=
sitemap.parser.follow-redirects=true

transformer.from-sitemap.main-page-url=
transformer.from-sitemap.ignore-errors=false
transformer.from-sitemap.filter-redirects=false

transformer.from-headless-crawling.main-page-url=
transformer.from-headless-crawling.filter-redirects=false
transformer.from-headless-crawling.seed-relative-urls=
transformer.from-headless-crawling.exclude-extensions-regex=
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
<property name="siteMapParser" ref="siteMapParser" />
<property name="ignoreErrors" value="${transformer.from-sitemap.ignore-errors}" />
<property name="filterRedirects" value="${transformer.from-sitemap.filter-redirects}" />
<property name="mainPageUrl" value="${transformer.from-sitemap.main-page-url}" />
</bean>

<bean name="FROM_HEADLESS_CRAWLING" class="org.vividus.crawler.transformer.HeadlessCrawlerTableTransformer"
Expand All @@ -90,6 +91,7 @@
<property name="seedRelativeUrls" value="${transformer.from-headless-crawling.seed-relative-urls}" />
<property name="excludeUrlsRegex" value="${transformer.from-headless-crawling.exclude-urls-regex}" />
<property name="excludeExtensionsRegex" value="${transformer.from-headless-crawling.exclude-extensions-regex}" />
<property name="mainPageUrl" value="${transformer.from-headless-crawling.main-page-url}" />
</bean>

<bean id="FROM_HTML" class="org.vividus.crawler.transformer.HtmlDocumentTableTransformer">
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright 2019-2023 the original author or authors.
* Copyright 2019-2024 the original author or authors.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -18,11 +18,17 @@

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertThrows;
import static org.mockito.Mockito.mock;
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.verifyNoInteractions;
import static org.mockito.Mockito.when;

import java.net.URI;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang3.StringUtils;
import org.jbehave.core.configuration.Keywords;
import org.jbehave.core.model.ExamplesTable.TableProperties;
import org.jbehave.core.steps.ParameterConverters;
Expand All @@ -31,9 +37,11 @@
import org.junit.jupiter.params.provider.CsvSource;
import org.vividus.ui.web.configuration.AuthenticationMode;
import org.vividus.ui.web.configuration.WebApplicationConfiguration;
import org.vividus.util.UriUtils;

class FetchingUrlsTableTransformerTests
{
private static final URI PAGE_URI = UriUtils.createUri("https://example.com");
private static final String URLS = "urls";
private static final String COLUMN = "column";
private final TestTransformer transformer = new TestTransformer();
Expand Down Expand Up @@ -70,11 +78,48 @@ void shouldHandleInvalidInputs(String propertiesAsString, String tableAsString,
@Test
void testTransformWithoutMainApplicationPageUrl()
{
TableProperties props = new TableProperties(StringUtils.EMPTY, new Keywords(), new ParameterConverters());
transformer.setWebApplicationConfiguration(new WebApplicationConfiguration(null, AuthenticationMode.URL));
var exception = assertThrows(IllegalArgumentException.class, transformer::getMainApplicationPageUri);
var exception = assertThrows(IllegalArgumentException.class,
() -> transformer.getMainApplicationPageUri(props));
assertEquals("URL of the main application page should be non-blank", exception.getMessage());
}

@Test
void shouldReturnMainAppUrlFromTransformerParameter()
{
WebApplicationConfiguration webAppCfg = mock();
transformer.setWebApplicationConfiguration(webAppCfg);
transformer.setMainPageUrl(null);
TableProperties props = new TableProperties("mainPageUrl=%s".formatted(PAGE_URI), new Keywords(),
new ParameterConverters());
assertEquals(PAGE_URI, transformer.getMainApplicationPageUri(props));
verifyNoInteractions(webAppCfg);
}

@Test
void shouldReturnMainAppUrlFromMainPageUrlProperty()
{
WebApplicationConfiguration webAppCfg = mock();
transformer.setWebApplicationConfiguration(webAppCfg);
transformer.setMainPageUrl(PAGE_URI);
TableProperties props = new TableProperties(StringUtils.EMPTY, new Keywords(), new ParameterConverters());
assertEquals(PAGE_URI, transformer.getMainApplicationPageUri(props));
verifyNoInteractions(webAppCfg);
}

@Test
void shouldReturnMainAppUrlFromWebConfiguration()
{
WebApplicationConfiguration webAppCfg = mock();
when(webAppCfg.getMainApplicationPageUrl()).thenReturn(PAGE_URI);
transformer.setWebApplicationConfiguration(webAppCfg);
transformer.setMainPageUrl(null);
TableProperties props = new TableProperties(StringUtils.EMPTY, new Keywords(), new ParameterConverters());
assertEquals(PAGE_URI, transformer.getMainApplicationPageUri(props));
verify(webAppCfg).getMainApplicationPageUrl();
}

private static final class TestTransformer extends AbstractFetchingUrlsTableTransformer
{
@Override
Expand Down
Loading

0 comments on commit e9b0a63

Please sign in to comment.