Search for better cost-effficient cloud provider #62

rivernews · 2020-03-31T20:22:38Z

This ticket also deals with the vision of this project. If we want to scale while paying reasonable bill for cloud provider, we need a more flexible way to run a kubernetes cluster. And using Kubernetes as a service definitely is limiting our way on that path.

Ideally something like AWS Fargate will do the best - if we can lower the cost when our cluster is at idle, then we can afford more concurrency while scraper jobs fire up.

AWS route: more RAM w/ cost efficiency

Insights into bare metal k8, plus ingress w/o load balancer.
- Use kuberspray to install on a AWS EC2
- Kops is another option.
- Interesting idea: use k8 auto scale to reduce cost. You may have a cheap node (1vCPU, 2G RAM), put the supervisor server (SLK) there, and then run the scraper jobs and selenium server in a larger node (4vCPU, 8G RAM). When no scraper job running, you just shutdown the larger node. We may see how to use tag in k8 to specify which node should a scraper job / deployment deploy to.
- Read this article from k8 external-dns to help setup ingress w/o load balancer.

Several requirements about provisioning on AWS

No external load balancer, that is, no use of any AWS ELB or ALB
Needs to be done in terraform
Needs to be installed on EC2, not some managed Kubernetes services.

Elastic scale route: save cost

Approach 1: AWS Fargate, or any other container service

Fargate is fully container base, so not suitable for long-living and admin servers, like SLK. But, it is suitable for running scraper job, or even selenium server.
Scraper job needs to have access to selenium server, and world internet. But other than that, it doesn't need much networking, so maybe don't even need an ingress.

Approach 2: K8 auto-scaling, K8 API

This should be faster because we just use existing terraform and digitalocean, just create another node (droplet they call), and assign selenium & scraper job on it. Then you should be able to programmatically kill the node or force down scale the node.

Approach 3: manual slack command in K8s

This approach is supposed to be the most feasible, fast to start one. No need to seek for other platform. This approach aims to create a low cost node running SLK - SLK has to be up all the time, in order to receive manual scale up / scale down command. That is, this approach uses SLK as a platform to manually scale up and down. This should save us cost and prevent keeping a large-cost node running w/o any scraper job present.

Launch a low-profile node in K8, use it for all the persistent stuff - redis, SLK, and all other essential k8s stuff like ingress controller.
Input a command in slack like up, then SLK triggers a travis build, which will run a terraform script to provision the resources for selenium server, on a dedicated node.
- SLK should report back in slack channel that travis finished running, and dedicated node and selenium is ready.
- Let SLK assign k8 jobs on the new node.
- In slack now we are able to start jobs by rrr or ccc.
Input a command in slack like down, so that it will trigger a travis build, which will run the same terraform script but using destroy to destroy the node, along with all the selenium server and scraper jobs.

The text was updated successfully, but these errors were encountered:

rivernews · 2020-04-03T23:03:19Z

Plan for elastic scaling

Digitalocean: only host SLK + redis. Use smaller machine, perhaps 1vCPU, 2G RAM
Run selenium server and scraper jobs via some API.
Network:
- SLK -> API -> container platfotm
- SLK <-> Redis <-> scraper job <-> selenium server (High RAM)

If SLK can programmatically spin up a container for both selenium server and scraper job, that would be awesome. Github AWS SDK for javascript. AWS CDK for EC2 npm page. Spec for each container:

Selenium server: high RAM, needs to be accessible by scraper job, export port 4444
Scraper job: env vars, needs to access selenium server into port 4444

One thing we want to ask is, VPC and subnet, does they cost anything? If not, we can use tf to create them beforehand; Otherwise, we may want to include them into the dynamically creating resource logic in SLK.

NAT gateway is for EC2 to access world internet, which is essential - our selenium server needs it to connect to gd, our scraper also needs it to communicate w/ SLK & redis. NAT gateway is 0.045/h in oregon. Montly $32, yike!
Gateway VPC endpoint allows EC2 to connect to AWS S3 and DynamoDB. No charges for VPC endpoint. (But we need to access gd! So cannot rely on this)
Interface VPC endpoint - more AWS services that "support interface VPC endpoints"
Internet gateway - seems like best for us. No charge for this. Looks like the difference between VPC endpoint and this is private or public subnet. You can only use public subnet, and use public IP. This should be fine for us. But does the public subnet/IP/elastic IP cost anything? The IP for the instance will be globally unique.
- Looks like public IP does not cost, but elastic IP may because it's persistent - public IP is not persistent. But dns name is also provided when you create the vpc / ec2 by default, so that's a fix value you can use outside the VPC.

Plan for more RAM

Create VPC, subnets, gateways, etc on aws, via Terraform
Create EC2 instance
- Test if we can http to reach it
Install K8 on it, via Terraform?
Install ingress on it, via Terraform?
- Test if we can http to reach K8 ingress
Install app on K8, via Terraform?

rivernews · 2020-04-23T20:16:09Z

Elastic Approach

Looks like elastic approach is probably the most cost efficient one. The idea is basically:

A master node where our SLK system runs on it
- This can be K8 on digitalocean; after all k8 is still a good way to run multiple long-running microservices like redis, and other apps in the future.
A platform we can programmatically run container and has network connectivity to communicate back w/ SLK. Pick one of the followings:
- Same k8 cluster on digitalocean, but new node. Will need digitalocean or k8 API to scale up / down the following: node & deployment for selenium and job (for scraper, more like a simple container).
  - Better use DigitalOcean because it's related to VM size, which you need Digitalocean to handle that and control pricing.
- EC2 on AWS. Will need AWS SDK to scale up / down:
  - EC2 + Internet Gateway setup
  - Install and run selenium server on EC2, possibly in docker container fashion
  - Run container on EC2 ... maybe Fargate is a better idea, but Fargate is also pricy. We need more research into this, to see if there's any drawbacks.
  - Delete EC2 and related sources
- Azure has so called Container Instance, worth of taking a look
- Any container service platform that meets our API control & network requirement

rivernews · 2020-04-25T19:32:08Z

K8S Elastic Approach

Looks like via nodehs DO client, the node pool and its nodes are quite troubling. Node inside stucks at "provisioning". Why is that?

Maybe DO's busy at that time?
Perhaps we want to ask: how long does it take DO to 1) create a node pool 2) scale up a node in a node pool?

Another method is to use tf to create a separate node pool of 0 node count. Then SLK will just update "count" when need to scale up. Then, do polling to check node readiness.

rivernews · 2020-04-26T03:38:46Z

Practical approach on k8s elastic

Prepare node pool with count 1
Let selenium server run on additional node pool - it's the largest RAM consumer, it consumes almost 4G when running 4 scraper jobs. SLK seems relatively small consumer
Also create selenium deployment along with node pool creation. This needs to be a async job.
- We currently use TF to provision selenium server. Now we want to run it programmatically.
- See if we can avoid unnecessary resource. Is deployment / service required? If not, try to spin up a selenium server by just a job.
  - Service account - do we need this? It seems to be required by deployment. Every namespace has a default service account in place. So I guess this is not needed at least in this case.
  - Namespace - might need. You can let selenium share same ns as scraper job, which lets clean up easier. But of course, you can let namespace just dedicated for selenium - since you may want to persist scraper in SLK ns to access logs from scraper job.
  - Deployment is needed - to spin up container
  - Service is needed - to let jobs / SLK access to container
The rest will be just use nodeSelector when provisioning K8s job.

rivernews · 2020-04-29T00:45:07Z

Tweaking for appropriate node size

Configuration: 2vCPU, 4G RAM on selenium; SLK & rest of k8s infra on 1vCPU, 2G RAM. Running 1 concurrent scraper job

Primary node

scraper job CPU is neglect-able.

Scraper job consumes around 177 MB. ~~SLK uses 29 MB~~. We're not testing SLK because we're using local dev SLK.

Worker node

Only running selenium.
1 scraper, uses around 0.8 CPU

1 scraper, uses up tp 900MB.

Services on Primary node like grafana getting a bit slow to response, so perhaps creating scraper job in worker node as well.

Problems

We observe our scraper worker node are used by irrelevant deployment, e.g., SLK.
While we limit scraper job and selenium deployment can only do on scraper worker node, we did not stop other deployments, including SLK, to deploy on those scraper worker node. Is there a way to do some exclusion for a node pool?
- One approach - assign SLK manually to the default k8s node pool
- Any other stuff that you'd like to limit to run on default node pool? Probably redis too.
  - Let's extend our TF microservice module to support assigning node selector in deployment.

rivernews · 2020-04-30T06:57:04Z

Tuning Performance & Throughput

Several things we want to try:

Is travis now working? Let's try to provision some workload on travis to maximize concurrency
- Resolve redis timeout at local
- Test travis locally, does it work? Then decide next step; remember travis is optional
- Yes, travis can do. If we just add it to existing prod setup, it will not add selenium workload & not add scraper job, so win-win. But will add SLK sandbox workload, which should be fine and handled by primary node 2CPU/4G.
Wait or terminate current workload
- You can try just kill selenium namespace
- Deploy SLK so that it destroy / terminates current all workload
- Then kill worker node
Let's try vertical scale for selenium
- Give it a large enough droplet as worker node. Needs a balanced CPU & Memory. 4G / >15%4vCPU for 4 scraper jobs.
- Provision both java scraper jobs and selenium on worker node
- Test for higher concurrency - at least accommodating 8 scraper jobs. If node size can't do it, change to larger size. --after including travis, we can allocate up to 13 - 7 k8s jobs at a time.
Last: research for better ways to scale selenium standalone / grid / zelenium / ... on kubernetes

Benchmarking

All k8s scraper job & selenium running on worker node.

Primary Node:

SLK:
Initial: 60MB
4 sandbox processes: 175 MB (+115MB, 29MB/process)
10 sandbox processes: spark: 740MB (+680MB, 68MB/process); steady: 600MB (+540MB, 54MB/process)

Node memory usage: 2.1G/4G , around 50%.
Estimated remaining capacity: safely +1G workload == at least 10 more sandbox == 20 total sandbox

Worker Node:
Selenium, 4 sessions: 1G-3G (250-750MB/session), average 2.3G (575MB/session).
Java scraper container, 4 jobs: 180-200MB/per job, total 720MB-800MB; Actual (incl. overlapping time): 8 k8s jobs concurrently and total of 1.7G.

Node memory usage: spark: 5G/8G, steady 4.5G/8G.
Estimated remaining capacity: safely +2G workload == 2-3 more k8s jobs == total 6-7 k8s jobs.

rivernews · 2020-05-03T06:42:21Z

Problem

Travis: connect to redis keeps getting timed out.
K8s not very responsive when multi node, job > 2.
- Is it caused by high CPU? ... but primary node CPU utilization is low. Redis resides on primary node.
- Does it have sth to do with we assigning redis to primary node? Or should we let k8 decide where redis should reside?
- Or it has sth to do with our S3 dispatching jobs too many (small job splitting size) + no dispatch delay, so that it occupies redis command transaction too much and redis cannot respond?
- Is it because our droplet's network did not setup firewall 6378? ...well, our primary node, which redis resides, does have firewall 6378 allowed.
Warn: too many k8s jobs may cause glassdoor blocking - since k8 node IP is fixed. We'll get "sign in link is not found" in this case.
- May also be throttling - when no s3 dispatch delay or too short, will result in burst request against glassdoor.

Tunning Profile

rivernews · 2020-05-10T04:02:13Z

Final Milestone

While there're a lot space for improvement, we could set a final milestone here just to achieve two things:

Scalable selenium, and can scale up largely
Auto scale down to save cost

Steps

Test Hub-Node paradigm in production, also for benchmarking workload
Configurable scale - currently we have 4 scrapers/k8 node; total of 8. We want to be able to arbitrarily set this number
- Set a fixed amount on scalability on each node; only scale the number of node. Each node runs 4 scrapers. We hope that 2vCPU/4G would be enough, but we need to see how selenium consumes in the long term Further test on hub-node arch; can only test conservative mode which is 2nodes/4jobs = 8 total. See if timeout still occurs (cannot reach hub error <-- possibly a k8s network issue, in fact hub are mostly available; chrome node initial registration did take some time but only once needed)
  - We keep getting remote driver creation error - unreachable browser. When we use curl to access hub on a different namespace, 1) can resolve IP (there's one time initially can't, but after that it always resolves) 2) sometimes can get response from /wd/hub/status, but a lot of times get port 4444: Connection refused, which must be the main cause of our unreachable error - in this case this is not a big issue with Cilium (perhaps just initially that could not resolve host error), but more of sth on the hub side. If you use port forward, you don't see this issue.
- If we still can't get the k8s network issue right, or can't figure out root cause of unreachable remote driver, then we might want to try pod archi - pack 1 scraper + 1 standalone selenium into a job.
  - Test conservative mode -- see resource consumption! ... 4vCPU-8GRAM*5/4scraperPerNode/total20 -> 5xx succeed / 5 failed, fail: review panel locating issue
  - Test aggressive mode -- 2CPU * 8 / 3 scraper each / total 24 / splitSize = 500 smaller to avoid error lost ->
  - If both conservative and aggressive mode passed, we can consider adopt pod archi as final solution.
We may want to lock down to a working selenium hub / chrome node version.
Wrap up, merge PR, close this ticket

Side notes

We got a lot of unreachable browser exception when scraper trying to create remote driver session against hub. This issue comment provides a retry-again approach to tackle this problem, simple, a bit brutal force but very effective, especially we're pretty sure the hub capacity is there and ready.
The issue comment later on also pointed out that he changed a kubernetes network plugin and it improved the network condition in K8.
- Since in our case we are using Cilium, we may refer to debug network issue with Cilium first "Debugging and Monitoring DNS issues in Kubernetes".

rivernews · 2020-05-15T23:43:21Z

Standlone Approach

Things going well until we face the challenges here

As you can see, there's a problem with k8s node assignment algorithm. Before the whole cluster went nuts, you can see besides the 2+2 job-switching overlap, which is dangerous too and we can lower the memory request, but there's an additional job assigned to this node, making it have 5 jobs running concurrently at that moment.

Looks like we can't trust k8s's node assignment. We can lower the memory request so that a job doesn't claim such much memory at the beginning, only claim more when a job needs it - we can do that. But we don't have control over node assignment.

Unless k8s has some additional parameter to configure this, we will need to implement this node distribution algorithm by our own.

Two ways:

K8 built-in Pod Topology Spread Constraints. -> But we can't use it, because in v1.16 it's disable by default in feature gate; and DO doesn't allow you to change feature gate settings. Their DO doc said something but didn't give a complete list of their feature gate, I guess they're just using defaults from k8 officials.
Use redis semaphore to realize this
- At the point new job requests semaphore, the old job should already have been releasing it, so we may not have to worry too much about distinguishing semaphore acquiring timeout VS not available, but we can still try to look at it.
- We may want to look into the code base and see how we can contribute. We need a reset() method.

rivernews · 2020-05-17T18:54:37Z

Challenges implementing anti-affinity by redis semaphore

The semaphore objects are possibly created across different node processes. Got two issues

The release() doesn't work - it says it has no identifier.
- Error: semaphore k8NodeResourceLock-0b63934f-b75c-4110-9489-ef220d6050cf has no identifier
- Which will cause the next job error nodeId is empty, did not acquire semaphore successfully.
Refresh
- Refresh interval inside semaphore object does not stop even after semaphore key deleted in redis, causes error: Error: Lost semaphore for key k8NodeResourceLock-8d44fefc-fcec-4822-a145-5a55f1907a14

Some ideas why previous work succeeded:

As long as acquire() and release() are executed in the same (child) process, it's ok.
Probably only one acquire() session should be executed at a time for a (child) process.
If this is true, we can fix this while preserving all the refresh / identifier state functionalities by
- For each time in acquire(), also store the local semaphore object. This will contain the (single) identifier.
- Expect the corresponding release() be executed in the same (child) process. Then it will release with the correct identifier!
- This will work as long as you 1) acquire and release for a specific resource (i.e. against a identifier) within the same (child) process and 2) only acquire once at a time, that is, never do another acquire before previous is released.
- The refresh issue which causes "Error: lost semaphore for key..." should now be handled by each release().

rivernews · 2020-05-17T22:08:26Z

Overlap memory usage issue

During job switch, looks like k8s job does not immediately release memory.

Did the job actually complete when scraper finish? Is it related to java scraper pubsub adapter is preventing main thread from shutting down?
- Yes, the job does complete. But the memory utilization does not go down. At the time being when green (next scraper job) starts, the previous job scraper-job-1589753321112 already completed with scraper-job-1589753321112 1/1 2m2s 2m3s, but its memory claim (pink) does not go down, even after a minute.
  
  Ok, it finally goes down, after almost 5 minutes (pink).
  
  Another example, at this time point, 748 (green) and 452 (yellow) already completed, but they didn't release their memory immediately, while new jobs are already incoming (orange and blue on top). Eventually they went down after 4-5 minutes, but that already cause a memory spike approaching node memory capacity 4G.
  
  Related SO question to this. Keyword: gc, garbage collector, complete job pod resource

Some ideas to tackle this situation

Delete the job when scraper job finalizes

Shape of one node

Another

Doesn't look like the overlapping issue is solved, but slightly better than previous. One guess is, the job "object" is deleted, but its pods remain there occupying resources.

Try out the propagation policy when deleting job. Here also instructs that we need the propagation policy set. Some more details about foreground and background, but looks like not many people are comparing their pros and cons.

Looks like it works perfectly now!

And another node

Final check

We may also try a "strict resource limit" test and see if scraper jobs can survive. If so, probably GC will work harder to recycle, then we may consider raising per-node capacity.
17node/34jobs crashed the entire cluster.
The rest is a final check on a full suite s3 job on production. If it runs w/o error, we can merge the pull request! 12node/24jobs looks good, ~~haven't succeed to the end yet~~ succeed!
- One error in groupon page 110, cannot locate review panel; verified by accessing the current webpage and indeed the review panel vanished. Looks like some bug on their end. No need to fix on our end.
- Nothing wrong in scaling, got overall 4 errors, all about cannot locate review error, but this shouldn't have to do with our scaling here, so we created another ticket to deal with it Ability to skip a problematic page and move on #75 .
- Efficiency benchmark: 20:45 -> 5:15, 8h 30min, 487 success / 4 fail / 450 split size, impressive! 🎉
- At a few time point the max mem util slightly exceed 3G; otherwise not much concerns. However, this may prevent us from scaling up further. Because looks like the SLK node, which is also the k8 core node, is reaching 3.6G/4G.
- We may need to find a way to let SLK scale. Perhaps SLK having its own node. Keep in mind that current base cost is $20 at idle. If we can dynamically scale and let idle cost be lower than $20, we can afford more when scraper job is in progress.
  - Some bull github issue comments say that you can just have multiple node main process running the same javascript code, pointing to the same redis server, then the sandbox process will automatically be spread across. So what we can try: k8 deployment replica=2 and see in grafana if two SLK instances take workloads evenly.
  - We do want to think of how to scale down SLK. Perhaps a start point is: primary node pool use 1v2G, smaller. A new node pool dedicated for SLK using 1v3G, autoscale 123. SLK deployment initial replica=1. Upon s3 job request, request SLK deployment scale to 2~3, then wait for SLK to scale and have minimum availability, then start s3 job. When s3 finalizes, scale SLK deployment to 1. For node down scale we are hoping k8 will do that automatically for us, but perhaps waiting a bit longer than 5-10 minutes.
Can we increase per-node capacity to 3 while having 12 nodes (total of 36 concurrency)? Let's look at the peak memory utilization in each node. Make sure there's up to 1.5G free memory at any given time point.

rivernews · 2020-05-19T01:20:33Z

rivernews · 2020-05-24T19:50:50Z

The scaling up and down is quite stable right now. Further automation would be really hard to maintain the same cost level around $20-40 monthly and can't really save us money while have the ability to scale up.

Summary

The current max capacity is 60 jobs at max, using core droplet size of 8G RAM. Could be either 4v8G, or memory-optimized 1v8G. Basic monthly bill is $40 at this memory size. But we figured out a way to bypass letsencrypt duplicated certificate, so we can always scale down the entire k8s cluster. Of course we still have to do this in terminal. Would be ideal to have a meta-service running, at least trigger travis job to trigger k8s provision / deletion. Perhaps heroku could be a good place to do this due to its free plan.

The scaling up cost is additional, using 2v4G machines.

rivernews added the Priority label Mar 31, 2020

rivernews mentioned this issue Apr 29, 2020

062 scale up #68

Merged

rivernews changed the title ~~Search for better cost-effficiency cloud provider~~ Search for better cost-effficienct cloud provider May 18, 2020

rivernews changed the title ~~Search for better cost-effficienct cloud provider~~ Search for better cost-effficient cloud provider May 18, 2020

rivernews closed this as completed in #68 May 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search for better cost-effficient cloud provider #62

Search for better cost-effficient cloud provider #62

rivernews commented Mar 31, 2020 •

edited

Loading

rivernews commented Apr 3, 2020 •

edited

Loading

rivernews commented Apr 23, 2020 •

edited

Loading

rivernews commented Apr 25, 2020 •

edited

Loading

rivernews commented Apr 26, 2020 •

edited

Loading

rivernews commented Apr 29, 2020 •

edited

Loading

rivernews commented Apr 30, 2020 •

edited

Loading

rivernews commented May 3, 2020 •

edited

Loading

rivernews commented May 10, 2020 •

edited

Loading

rivernews commented May 15, 2020 •

edited

Loading

rivernews commented May 17, 2020 •

edited

Loading

rivernews commented May 17, 2020 •

edited

Loading

rivernews commented May 19, 2020 •

edited

Loading

rivernews commented May 24, 2020

Search for better cost-effficient cloud provider #62

Search for better cost-effficient cloud provider #62

Comments

rivernews commented Mar 31, 2020 • edited Loading

AWS route: more RAM w/ cost efficiency

Elastic scale route: save cost

Approach 1: AWS Fargate, or any other container service

Approach 2: K8 auto-scaling, K8 API

Approach 3: manual slack command in K8s

rivernews commented Apr 3, 2020 • edited Loading

Plan for elastic scaling

Plan for more RAM

rivernews commented Apr 23, 2020 • edited Loading

Elastic Approach

rivernews commented Apr 25, 2020 • edited Loading

K8S Elastic Approach

rivernews commented Apr 26, 2020 • edited Loading

Practical approach on k8s elastic

rivernews commented Apr 29, 2020 • edited Loading

Tweaking for appropriate node size

Primary node

Worker node

Problems

rivernews commented Apr 30, 2020 • edited Loading

Tuning Performance & Throughput

Benchmarking

Primary Node:

rivernews commented May 3, 2020 • edited Loading

Problem

Tunning Profile

rivernews commented May 10, 2020 • edited Loading

Final Milestone

Side notes

rivernews commented May 15, 2020 • edited Loading

Standlone Approach

rivernews commented May 17, 2020 • edited Loading

Challenges implementing anti-affinity by redis semaphore

rivernews commented May 17, 2020 • edited Loading

Overlap memory usage issue

Final check

rivernews commented May 19, 2020 • edited Loading

Scaling SLK

rivernews commented May 24, 2020

Summary

rivernews commented Mar 31, 2020 •

edited

Loading

rivernews commented Apr 3, 2020 •

edited

Loading

rivernews commented Apr 23, 2020 •

edited

Loading

rivernews commented Apr 25, 2020 •

edited

Loading

rivernews commented Apr 26, 2020 •

edited

Loading

rivernews commented Apr 29, 2020 •

edited

Loading

rivernews commented Apr 30, 2020 •

edited

Loading

rivernews commented May 3, 2020 •

edited

Loading

rivernews commented May 10, 2020 •

edited

Loading

rivernews commented May 15, 2020 •

edited

Loading

rivernews commented May 17, 2020 •

edited

Loading

rivernews commented May 17, 2020 •

edited

Loading

rivernews commented May 19, 2020 •

edited

Loading