Table of Contents
This project explains step by step how to create a webcrawler with Kotlin and run it as a Job in IBM Cloud Code Engine.
IBM Cloud Code Engine offers the ability to run jobs, in other words, software that is meant to run and finish
relatively quickly. Applications, on the other hand, are meant to accept HTTP requests. For more information visit
https://cloud.ibm.com/docs/codeengine?topic=codeengine-getting-started
The job must comply the following:
- Linting rules
- Static code analysis
- Automatic deployment
- Cloud readiness
Each step will be described in detail in this README.
This section should list any major frameworks/libraries used to bootstrap your project. Leave any add-ons/plugins for the acknowledgements section. Here are a few examples.
Clone the repo and build it using gradle
You need a JVM installed and we recommend IntelliJ IDEA as development environment
No special requirements needed for now.
Just run the application using ./gradlew run
Select new project and configure it with the following parameters:
- Language: Kotlin
- Build system: Gradle
- JDK: version 17
- Gradle DSL: Kotlin
- Add sample code
This will create a file called "Main.kt" with the following content:
fun main(args: Array<String>) {
println("Hello World!")
// Try adding program arguments via Run/Debug configuration.
// Learn more about running applications: https://www.jetbrains.com/help/idea/running-applications.html.
println("Program arguments: ${args.joinToString()}")
}
If the execution of ./gradlew run
finishes without problems, you can proceed to the next step. Otherwise, you can
check the following:
- JVM version
- Latest gradle version
- SDK in IntelliJ project
Please follow the following link. Once the dependencies have
been downloaded, a new task ./gradlew shadowJar
should exist. Once executed, two jars can be found in
build/libs
. We are interested in the one that ends with all.jar
, since that is the one that has all its
dependencies integrated.
In order to test if it works, run the command java -jar filename-all.jar
inside build/libs
.
The recommended option to set up a ci quickly is to use GitHub Actions. Based on a configuration in YAML format we can define actions that will be executed, for example after any push. For more information, please visit the following link.
We can include a file with the following content inside the directory .github/workflows/push.yml
:
name: Push Workflow
on:
push:
branches-ignore:
- main
jobs:
Push-Actions:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v3
- name: Build jar
uses: gradle/gradle-build-action@v2
with:
arguments: shadowJar
This ensures that every time a developer pushes changes, this check will be executed.
To save time and headaches later on, it is recommended that the application be cloud-ready from the beginning of development. One of the most effective ways to achieve this is to constantly maintain and test a Dockerfile. With this we achieve the following:
- Control over the environment in which the application is executed.
- Being able to reproduce the problems on your local machine.
- Possibility to upload that image to a container registry through a ci/cd pipeline.
In our case we use the ubi9/openjdk-17 image provided by Red Hat. This decision is based on two criteria:
- constant maintenance by Red Hat.
- Simple usage: we need to copy the jar file to
/deployments
and we do not need additional parameters for our program to run.
The Dockerfile description:
FROM registry.access.redhat.com/ubi9/openjdk-17:1.14-2
RUN mkdir app
WORKDIR app
COPY --chown=default . .
RUN ./gradlew shadowJar
RUN cp build/libs/webcrawler*-all.jar /deployments
Build command: docker build . -t webcrawler:latest
Run command: docker run --rm webcrawler:latest
With this we can even update our pipeline and make building the Dockerfile and execution part of it.
name: Push Workflow
on:
push:
branches-ignore:
- main
jobs:
Push-Actions:
runs-on: ubuntu-latest
steps:
- name: Check out repository code
uses: actions/checkout@v3
- name: Build jar
uses: gradle/gradle-build-action@v2
with:
arguments: shadowJar
- name: Build docker image
uses: docker/build-push-action@v4
with:
tags: webcrawler:latest
- name: Run the software
run: docker run webcrawler:latest
If you do not have an account at IBM Cloud, it is the time to create one. Through the following link you can see that there are two options, either specify an image, or a repository. In our case the preferred option is an image that we can upload to IBM Cloud whenever we want. This way we can replicate the same status locally, as in the cloud.
By clicking on Start Creating we can select between creating an application or a job. The difference between both is basically that the application is intended to serve HTTP requests and the job is intended to execute a task.
As name, we can write webcrawler and clicking on create project we can define the location, the name of the project, resources and tags. For the moment we will focus only on the first two. As location, it is advisable to choose the nearest one and as name we can write "webcrawler".
After clicking on "create project" we can configure the image to be executed. For the moment we can leave the HelloWorld example.
The rest of the options can be left as default and click on "create". On the next page we can leave the default settings and click on "submit job". In the next menu we can leave the default settings.
If everything worked correctly, the job will appear as completed.
For more detailed information, please visit the official documentation.
The next step is to create a container registry, in which we can save our container images. To do this we start by
searching for Container Registry
in the IBM Cloud search bar.
After clicking, we enter a product information page. In it we see the limitations of the lite version. At the time of writing, the limit is 0.5 GB of storage, which is sufficient for our purposes.
After clicking on "Get Started", we find a page that tells us how we can upload our images to the registry. To download IBM Cloud CLI with the necessary plugins, these two commands are sufficient:
curl -fsSL https://clis.cloud.ibm.com/install/linux | sh
ibmcloud plugin install -f container-registry
Once we have the software installed, we can try to upload the image generated in step 4 using the following commands (in the case of Central Europe region)
ibmcloud login
ibmcloud cr region-set eu-central
ibmcloud cr namespace-add webcrawler
docker tag webcrawler:latest de.icr.io/webcrawler/webcrawler:latest
ibmcloud cr login
docker push de.icr.io/webcrawler/webcrawler:latest
Now if you search inside the namespace in the container registry you can find the image.
Due to the 512MB space limitation, it is important to retain only the most recent image. This is achieved by going into
settings and selecting the Retain only the most recent images in each repository
option, as well as disabling Retain untagged images
. Finally, select `Set recurring policy.
For more information, you can consult the official documentation.
To keep the image size as small as possible, we can use multi-stage builds. An example would be:
FROM registry.access.redhat.com/ubi9/openjdk-17:1.14-2 as builder
RUN mkdir app
WORKDIR app
COPY --chown=default . .
RUN ./gradlew shadowJar
FROM registry.access.redhat.com/ubi9/openjdk-17-runtime:1.14-2
COPY --from=builder --chown=default /home/default/app/build/libs/*-all.jar /deployments
To build and upload the image to ibmcloud we can take inspiration from the steps above.
For more information you can visit the official Docker documentation
First, we need to create an API key to authorize a connection from GitHub Actions. In this link you will find the button to create it.
After creation, you have the opportunity to copy it and enter it as a secret in the repository settings on GitHub. In
Settings: Secrets and Variables->Actions
, by clicking on the New repository secret
button, we must add two variables:
IBM_CLOUD_API_KEY
and IBM_CLOUD_REGION
.
In the tab next to Secrets
we can also define variables. In our case, we define CONTAINER_NAME
.
After that, we are ready to add a new workflow for deployment, here is an example:
name: Deploy Workflow
on:
push:
branches:
- main
jobs:
Deploy-Actions:
runs-on: ubuntu-latest
steps:
- name: Install IBM Cloud CLI
run: |
curl -fsSL https://clis.cloud.ibm.com/install/linux | sh
ibmcloud --version
ibmcloud config --check-version=false
ibmcloud plugin install -f container-registry
- name: Authenticate with IBM Cloud CLI
run: |
ibmcloud login --apikey ${{ secrets.IBM_CLOUD_API_KEY }} -r ${{ secrets.IBM_CLOUD_REGION }}
ibmcloud cr login --client docker
- name: Build and push docker image
uses: docker/build-push-action@v4
with:
tags: ${{ vars.CONTAINER_NAME }}
push: true
- name: Delete untagged images
run : |
ibmcloud cr image-prune-untagged -f
The code shows that we first download the IBM Cloud CLI, then install the dependencies, log in using the previously
defined secrets, build and upload the image to the registry and finally delete the previous images. This action will be
executed whenever there is a change in the main
branch.
For more information you can visit the following links:
On GitHub we can configure dependabot to keep our packages up to date. In our case we need to keep up-to-date packages
in gradle, docker and github-actions. The configuration is very simple, defining the file dependabot.yml
in the
.github
directory.
version: 2
updates:
# Maintain dependencies for GitHub Actions
- package-ecosystem: "github-actions"
directory: "/"
schedule:
interval: "weekly"
# Maintain dependencies for gradle
- package-ecosystem: "gradle"
directory: "/"
schedule:
interval: "weekly"
# Maintain dependencies for docker
- package-ecosystem: "docker"
directory: "/"
schedule:
interval: "weekly"
More information and possibilities by following this link.
First we have to install chromium
and chromedriver
on the Dockerfile. Part of the dependencies are in the
EPEL and CentOS Stream repositories. Here the code of the second part of the Dockerfile:
ENV HEADLESS=TRUE
ARG packages="chromium chromedriver"
# Installs the os dependencies (chromium and chromedriver)
USER root
RUN rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm \
&& rpm -ivh https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/Packages/centos-gpg-keys-9.0-20.el9.noarch.rpm\
&& rpm -ivh https://mirror.stream.centos.org/9-stream/BaseOS/x86_64/os/Packages/centos-stream-repos-9.0-20.el9.noarch.rpm\
&& microdnf --setopt=install_weak_deps=0 --setopt=tsflags=nodocs install -y $packages \
&& microdnf clean all \
&& rpm -q $packages
# Copies the jar from the build container
USER default
COPY --from=builder --chown=default /home/default/app/build/libs/*-all.jar /deployments
Next we need to make sure that on our development machine we also have those dependencies installed, along with the libraries specified in our gradle file:
dependencies {
implementation("org.seleniumhq.selenium:selenium-java:4.8.3")
implementation("com.github.ajalt.clikt:clikt:3.5.2")
testImplementation(kotlin("test"))
}
A code to execute a simple selenium hello world would be the following:
val chromeOptions = ChromeOptions()
if (headless) {
chromeOptions.addArguments(listOf("--headless", "--no-sandbox", "--disable-dev-shm-usage"))
}
val driver = ChromeDriver(chromeOptions)
driver.get("https://www.learn-html.org/en/Hello,_World!")
val element = driver.findElement(By.cssSelector("div#inner-text h1"))
println(element.text)
driver.close()
This code is based on the information in this link. Basically in order to run selenium inside a container we need to run it with the following arguments:
- no-sandbox
- headless
- disable-dev-shm-usage
To distinguish between execution in docker container and development machine we define a "headless" flag, which depends on an environment variable. For this purpose we use the Clikt library. A complete code example (without imports) would be the following:
class WebCrawler : CliktCommand() {
private val headless: Boolean
by option("--headless", help = "This flag sets the headless mode on", envvar = "HEADLESS")
.flag()
override fun run() {
val chromeOptions = ChromeOptions()
if (headless) {
chromeOptions.addArguments(listOf("--headless", "--no-sandbox", "--disable-dev-shm-usage"))
}
val driver = ChromeDriver(chromeOptions)
driver.get("https://www.learn-html.org/en/Hello,_World!")
val element = driver.findElement(By.cssSelector("div#inner-text h1"))
println(element.text)
driver.close()
}
}
fun main(args: Array<String>) = WebCrawler().main(args)
Distributed under the MIT License. See LICENSE.txt
for more information.
Nestor Acuña Blanco - nacuna85@gmail.com
Project Link: https://github.com/nestoracunablanco/webcrawler
Use this space to list resources you find helpful and would like to give credit to.