Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Increasing ram usage and tool never finishes. #42

Open
vincentcox opened this issue Nov 19, 2020 · 14 comments
Open

Increasing ram usage and tool never finishes. #42

vincentcox opened this issue Nov 19, 2020 · 14 comments

Comments

@vincentcox
Copy link

Steps to reproduce:

Spawn a fresh Ubuntu 20.04 server (no GUI) VPS, install all the tools:

sudo apt update
sudo apt install nodejs -y 
sudo apt install npm -y
sudo apt install jq -y
sudo apt install chromium-browser -y
export PUPPETEER_EXECUTABLE_PATH="/usr/bin/chromium-browser" # Fix the "browser not installed" bug, "stolen" from the Dockerfile
npm install --global https://github.com/EU-EDPS/website-evidence-collector/tarball/master 
mkdir output_dir
website-evidence-collector --output output_dir/vincentcox.com --json --max 3 https://vincentcox.com --overwrite -- --no-sandbox # Fix the chrome sandbox issue, found somewhere in the issue tracker

It keeps running and it keeps eating resources:
rip memory
(rip memory)

Note that I am using the latest version from Github and that something might broke it in the Github version. But as explained in this issue (#41), I cannot access the official download link of the stable version.

@ghost
Copy link

ghost commented Nov 19, 2020

Do you have the same behavior when using chromium bundled with the puppeteer node package?

@vincentcox
Copy link
Author

How can I use the puppeteer node package? (sorry, I have little experience with nodeJs).

I installed the latest stable version (mentioned in your reply in my previous issue), it's the same issue.

@vincentcox
Copy link
Author

I have the same in docker, which is using the puppeteer node package.

I removed the versions in the Dockerfile to get it working:

RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      freetype-dev \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      yarn \

@ghost
Copy link

ghost commented Nov 19, 2020

It works for you now? Could you prepare a pull request then to help others?

Could you find out which versions you are using instead? I think I decided to fix the version numbers to have a more reproducable setup which is important for auditing.

@vincentcox
Copy link
Author

Sorry, it was not clear in my previous answer. I am trying things out, but they all break if I test them on my website (also for a client, but I don't want to share that one as my website is a good "test" example). So I said that tried docker (using a modified version to get it working), but got the same bug.

@ghost
Copy link

ghost commented Nov 19, 2020

In your example you have incuded --max 3, hence you scan also some other random pages of the same website. Can you please check if with only one page you still have the same behaviour? I would then try to reproduce your problem.

@vincentcox
Copy link
Author

It's unfortunately the same (when using the installed version in my initial post but with --max 1). I gave up on docker because I get this error:

error An unexpected error occurred: "EACCES: permission denied, scandir '/opt/website-evidence-collector/output/browser-profile'".

@vincentcox
Copy link
Author

@rriemann-eu if you need more info to debug let me know!

@ghost
Copy link

ghost commented Nov 24, 2020

So when I execute the following two commands, I do not get any error.

website-evidence-collector --output output_dir/vincentcox.com --json --max 1 https://vincentcox.com

website-evidence-collector --output output_dir/vincentcox.com2 --json --max 1 https://vincentcox.com -- --no-sandbox

I am using the latest version from master on opensuse. From the inspection.yml:

script:
  host: mars.fritz.box
  version:
    npm: 0.4.0
    commit: v0.4.0-70-ga956e2d
  cmd_args: '--output output_dir/vincentcox.com --json --max 1 https://vincentcox.com'
  environment: {}
  node_version: v10.22.1
browser:
  name: Chromium
  version: HeadlessChrome/80.0.3987.0
  user_agent: >-
    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
    Chrome/72.0.3617.0 Safari/537.36
  platform:
    name: Linux
    version: 5.8.14-1-default
  extra_headers: {}
  preset_cookies: {}
start_time: 2020-11-24T11:30:47.957Z
end_time: 2020-11-24T11:31:00.650Z

Does your problem occurs with all websites?

@vincentcox
Copy link
Author

Hmmm, might be something with my installation then. I'll go with docker then to avoid further mistakes and debugging time on your side. The dockerfile in the Repo doesn't work anymore.

If I want to build this I get this error:

root@client-testvm:~/test/website-evidence-collector#  docker build -t website-evidence-collector .
Sending build context to Docker daemon  3.995MB
Step 1/16 : FROM alpine:edge
 ---> 003bcf045729
Step 2/16 : LABEL maintainer="Robert Riemann <robert.riemann@edps.europa.eu>"
 ---> Using cache
 ---> f5d20c7a4860
Step 3/16 : LABEL org.label-schema.description="Website Evidence Collector running in a tiny Alpine Docker container"       org.label-schema.name="website-evidence-collector"       org.label-schema.usage="https://github.com/EU-EDPS/website-evidence-collector/blob/master/README.md"       org.label-schema.vcs-url="https://github.com/EU-EDPS/website-evidence-collector"       org.label-schema.vendor="European Data Protection Supervisor (EDPS)"       org.label-schema.license="EUPL-1.2"
 ---> Using cache
 ---> 16ece18d66c6
Step 4/16 : RUN apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha
 ---> Running in 5ca2fe0d3cde
fetch https://dl-cdn.alpinelinux.org/alpine/edge/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/edge/community/x86_64/APKINDEX.tar.gz
ERROR: unsatisfiable constraints:
  chromium-86.0.4240.111-r0:
    breaks: world[chromium~80.0.3987]
  yarn-1.22.10-r0:
    breaks: world[yarn~1.22.4]
The command '/bin/sh -c apk add --no-cache       chromium~=80.0.3987       nss       freetype       freetype-dev       harfbuzz       ca-certificates       ttf-freefont       nodejs       yarn~=1.22.4       bash procps drill coreutils libidn curl       parallel jq grep aha' returned a non-zero code: 2

I think this error is caused by this https://superuser.com/a/1486407/1039133

Unfortunately, Alpine-Linux Package Management drops older packages when there are newer versions available. This makes it hard to use Alpine Linux with docker since you want a reproducible image with exact versions.

@ghost
Copy link

ghost commented Nov 26, 2020

OK, so I will close this one until we know how to reproduce your problem on other systems. I will open a new issue on the docker problem, which deserves a solution.

@ghost ghost closed this as completed Nov 26, 2020
@vincentcox
Copy link
Author

Good idea, feel free to tag me in this!

@ghost
Copy link

ghost commented Nov 26, 2020

I can confirm this on docker:

It takes a lot of time and keeps using more and more ram.

docker run --rm -it --cap-add=SYS_ADMIN -v $(pwd)/output:/output website-evidence-collector https://vincentcox.com --overwrite

top:

top - 14:06:18 up 24 days,  1:59,  2 users,  load average: 2.61, 1.76, 0.79
Tasks: 121 total,   1 running, 119 sleeping,   0 stopped,   1 zombie
%Cpu(s): 60.1 us, 30.6 sy,  0.0 ni,  7.7 id,  0.0 wa,  0.0 hi,  0.0 si,  1.7 st
MiB Mem :   1994.0 total,    109.3 free,   1602.3 used,    282.4 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    169.5 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                        
 4006 ubuntu    20   0  445648  43844  28176 S  95.3   2.1   4:29.16 chrome                                                                                         
 4052 ubuntu    20   0 5504516 892576  50024 S  78.7  43.7   4:05.24 chrome                                                                                         
 4047 ubuntu    20   0  358264  52200  26516 S   8.3   2.6   0:27.17 chrome                                                                                         
  316 root      20   0   14804   4364   1408 S   0.3   0.2 114:20.20 docker-gen                                                                                     
  411 root      20   0   10988   3396   2880 R   0.3   0.2   0:00.37 top                                                                                            
    1 root      20   0  169324  10212   5544 S   0.0   0.5   1:44.17 systemd     

As I do not have this problem on my local computer without docker, I can imagine that it somehow depends on the Chromium version that is used. Maybe newer Chromium versions behave differently than the version HeadlessChrome/80.0.3987.0 I use on my local system.

@ghost ghost reopened this Nov 26, 2020
@vincentcox
Copy link
Author

Yeah the thing is: if it was just on my machine and not on docker it would be something on my side. But even if docker it's giving me the same issue.

With chromium 77.0.3865 (as used in this working dockerfile), it works for me.

Maybe this issue is not even in the scope of this project, but a chromium issue itself. For me it's okay if you guys close it, but keep in mind that other people might face the same issue (in docker or just using it installed on a system). Maybe my website is quite heavy to parse, but it's a standard Wordpress website so I think chances are high people will face the same situation.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant