Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search for better cost-effficient cloud provider #62

Closed
rivernews opened this issue Mar 31, 2020 · 13 comments · Fixed by #68
Closed

Search for better cost-effficient cloud provider #62

rivernews opened this issue Mar 31, 2020 · 13 comments · Fixed by #68
Labels

Comments

@rivernews
Copy link
Owner

rivernews commented Mar 31, 2020

This ticket also deals with the vision of this project. If we want to scale while paying reasonable bill for cloud provider, we need a more flexible way to run a kubernetes cluster. And using Kubernetes as a service definitely is limiting our way on that path.

Ideally something like AWS Fargate will do the best - if we can lower the cost when our cluster is at idle, then we can afford more concurrency while scraper jobs fire up.

AWS route: more RAM w/ cost efficiency

  • Insights into bare metal k8, plus ingress w/o load balancer.
    • Use kuberspray to install on a AWS EC2
    • Kops is another option.
    • Interesting idea: use k8 auto scale to reduce cost. You may have a cheap node (1vCPU, 2G RAM), put the supervisor server (SLK) there, and then run the scraper jobs and selenium server in a larger node (4vCPU, 8G RAM). When no scraper job running, you just shutdown the larger node. We may see how to use tag in k8 to specify which node should a scraper job / deployment deploy to.
    • Read this article from k8 external-dns to help setup ingress w/o load balancer.

Several requirements about provisioning on AWS

  • No external load balancer, that is, no use of any AWS ELB or ALB
  • Needs to be done in terraform
  • Needs to be installed on EC2, not some managed Kubernetes services.

Elastic scale route: save cost

Approach 1: AWS Fargate, or any other container service

  • Fargate is fully container base, so not suitable for long-living and admin servers, like SLK. But, it is suitable for running scraper job, or even selenium server.
  • Scraper job needs to have access to selenium server, and world internet. But other than that, it doesn't need much networking, so maybe don't even need an ingress.

Approach 2: K8 auto-scaling, K8 API

  • This should be faster because we just use existing terraform and digitalocean, just create another node (droplet they call), and assign selenium & scraper job on it. Then you should be able to programmatically kill the node or force down scale the node.

Approach 3: manual slack command in K8s

This approach is supposed to be the most feasible, fast to start one. No need to seek for other platform. This approach aims to create a low cost node running SLK - SLK has to be up all the time, in order to receive manual scale up / scale down command. That is, this approach uses SLK as a platform to manually scale up and down. This should save us cost and prevent keeping a large-cost node running w/o any scraper job present.

  • Launch a low-profile node in K8, use it for all the persistent stuff - redis, SLK, and all other essential k8s stuff like ingress controller.
  • Input a command in slack like up, then SLK triggers a travis build, which will run a terraform script to provision the resources for selenium server, on a dedicated node.
    • SLK should report back in slack channel that travis finished running, and dedicated node and selenium is ready.
    • Let SLK assign k8 jobs on the new node.
    • In slack now we are able to start jobs by rrr or ccc.
  • Input a command in slack like down, so that it will trigger a travis build, which will run the same terraform script but using destroy to destroy the node, along with all the selenium server and scraper jobs.
@rivernews
Copy link
Owner Author

rivernews commented Apr 3, 2020

Plan for elastic scaling

  • Digitalocean: only host SLK + redis. Use smaller machine, perhaps 1vCPU, 2G RAM
  • Run selenium server and scraper jobs via some API.
  • Network:
    • SLK -> API -> container platfotm
    • SLK <-> Redis <-> scraper job <-> selenium server (High RAM)

If SLK can programmatically spin up a container for both selenium server and scraper job, that would be awesome. Github AWS SDK for javascript. AWS CDK for EC2 npm page. Spec for each container:

  • Selenium server: high RAM, needs to be accessible by scraper job, export port 4444
  • Scraper job: env vars, needs to access selenium server into port 4444

One thing we want to ask is, VPC and subnet, does they cost anything? If not, we can use tf to create them beforehand; Otherwise, we may want to include them into the dynamically creating resource logic in SLK.

  • NAT gateway is for EC2 to access world internet, which is essential - our selenium server needs it to connect to gd, our scraper also needs it to communicate w/ SLK & redis. NAT gateway is 0.045/h in oregon. Montly $32, yike!
  • Gateway VPC endpoint allows EC2 to connect to AWS S3 and DynamoDB. No charges for VPC endpoint. (But we need to access gd! So cannot rely on this)
  • Interface VPC endpoint - more AWS services that "support interface VPC endpoints"
  • Internet gateway - seems like best for us. No charge for this. Looks like the difference between VPC endpoint and this is private or public subnet. You can only use public subnet, and use public IP. This should be fine for us. But does the public subnet/IP/elastic IP cost anything? The IP for the instance will be globally unique.
    • Looks like public IP does not cost, but elastic IP may because it's persistent - public IP is not persistent. But dns name is also provided when you create the vpc / ec2 by default, so that's a fix value you can use outside the VPC.

Plan for more RAM

  • Create VPC, subnets, gateways, etc on aws, via Terraform
  • Create EC2 instance
    • Test if we can http to reach it
  • Install K8 on it, via Terraform?
  • Install ingress on it, via Terraform?
    • Test if we can http to reach K8 ingress
  • Install app on K8, via Terraform?

@rivernews
Copy link
Owner Author

rivernews commented Apr 23, 2020

Elastic Approach

Looks like elastic approach is probably the most cost efficient one. The idea is basically:

  • A master node where our SLK system runs on it
    • This can be K8 on digitalocean; after all k8 is still a good way to run multiple long-running microservices like redis, and other apps in the future.
  • A platform we can programmatically run container and has network connectivity to communicate back w/ SLK. Pick one of the followings:
    • Same k8 cluster on digitalocean, but new node. Will need digitalocean or k8 API to scale up / down the following: node & deployment for selenium and job (for scraper, more like a simple container).
      • Better use DigitalOcean because it's related to VM size, which you need Digitalocean to handle that and control pricing.
    • EC2 on AWS. Will need AWS SDK to scale up / down:
      • EC2 + Internet Gateway setup
      • Install and run selenium server on EC2, possibly in docker container fashion
      • Run container on EC2 ... maybe Fargate is a better idea, but Fargate is also pricy. We need more research into this, to see if there's any drawbacks.
      • Delete EC2 and related sources
    • Azure has so called Container Instance, worth of taking a look
    • Any container service platform that meets our API control & network requirement

@rivernews
Copy link
Owner Author

rivernews commented Apr 25, 2020

K8S Elastic Approach

Looks like via nodehs DO client, the node pool and its nodes are quite troubling. Node inside stucks at "provisioning". Why is that?

  • Maybe DO's busy at that time?
  • Perhaps we want to ask: how long does it take DO to 1) create a node pool 2) scale up a node in a node pool?

Another method is to use tf to create a separate node pool of 0 node count. Then SLK will just update "count" when need to scale up. Then, do polling to check node readiness.

@rivernews
Copy link
Owner Author

rivernews commented Apr 26, 2020

Practical approach on k8s elastic

  • Prepare node pool with count 1
  • Let selenium server run on additional node pool - it's the largest RAM consumer, it consumes almost 4G when running 4 scraper jobs. SLK seems relatively small consumer
  • Also create selenium deployment along with node pool creation. This needs to be a async job.
    • We currently use TF to provision selenium server. Now we want to run it programmatically.
    • See if we can avoid unnecessary resource. Is deployment / service required? If not, try to spin up a selenium server by just a job.
      • Service account - do we need this? It seems to be required by deployment. Every namespace has a default service account in place. So I guess this is not needed at least in this case.
      • Namespace - might need. You can let selenium share same ns as scraper job, which lets clean up easier. But of course, you can let namespace just dedicated for selenium - since you may want to persist scraper in SLK ns to access logs from scraper job.
      • Deployment is needed - to spin up container
      • Service is needed - to let jobs / SLK access to container
  • The rest will be just use nodeSelector when provisioning K8s job.

@rivernews
Copy link
Owner Author

rivernews commented Apr 29, 2020

Tweaking for appropriate node size

Configuration: 2vCPU, 4G RAM on selenium; SLK & rest of k8s infra on 1vCPU, 2G RAM. Running 1 concurrent scraper job

Primary node

scraper job CPU is neglect-able.
image

Scraper job consumes around 177 MB. SLK uses 29 MB. We're not testing SLK because we're using local dev SLK.
image

Worker node

Only running selenium.
1 scraper, uses around 0.8 CPU
image

1 scraper, uses up tp 900MB.
image


Services on Primary node like grafana getting a bit slow to response, so perhaps creating scraper job in worker node as well.


Problems

  • We observe our scraper worker node are used by irrelevant deployment, e.g., SLK.
    While we limit scraper job and selenium deployment can only do on scraper worker node, we did not stop other deployments, including SLK, to deploy on those scraper worker node. Is there a way to do some exclusion for a node pool?
    • One approach - assign SLK manually to the default k8s node pool
    • Any other stuff that you'd like to limit to run on default node pool? Probably redis too.
      • Let's extend our TF microservice module to support assigning node selector in deployment.

@rivernews rivernews mentioned this issue Apr 29, 2020
@rivernews
Copy link
Owner Author

rivernews commented Apr 30, 2020

Tuning Performance & Throughput

Several things we want to try:

  • Is travis now working? Let's try to provision some workload on travis to maximize concurrency
    • Resolve redis timeout at local
    • Test travis locally, does it work? Then decide next step; remember travis is optional
    • Yes, travis can do. If we just add it to existing prod setup, it will not add selenium workload & not add scraper job, so win-win. But will add SLK sandbox workload, which should be fine and handled by primary node 2CPU/4G.
  • Wait or terminate current workload
    • You can try just kill selenium namespace
    • Deploy SLK so that it destroy / terminates current all workload
    • Then kill worker node
  • Let's try vertical scale for selenium
    • Give it a large enough droplet as worker node. Needs a balanced CPU & Memory. 4G / >15%4vCPU for 4 scraper jobs.
    • Provision both java scraper jobs and selenium on worker node
    • Test for higher concurrency - at least accommodating 8 scraper jobs. If node size can't do it, change to larger size. --after including travis, we can allocate up to 13 - 7 k8s jobs at a time.
  • Last: research for better ways to scale selenium standalone / grid / zelenium / ... on kubernetes

Benchmarking

All k8s scraper job & selenium running on worker node.

Primary Node:

SLK:
Initial: 60MB
4 sandbox processes: 175 MB (+115MB, 29MB/process)
10 sandbox processes: spark: 740MB (+680MB, 68MB/process); steady: 600MB (+540MB, 54MB/process)

Node memory usage: 2.1G/4G , around 50%.
Estimated remaining capacity: safely +1G workload == at least 10 more sandbox == 20 total sandbox
image

Worker Node:
Selenium, 4 sessions: 1G-3G (250-750MB/session), average 2.3G (575MB/session).
Java scraper container, 4 jobs: 180-200MB/per job, total 720MB-800MB; Actual (incl. overlapping time): 8 k8s jobs concurrently and total of 1.7G.

Node memory usage: spark: 5G/8G, steady 4.5G/8G.
Estimated remaining capacity: safely +2G workload == 2-3 more k8s jobs == total 6-7 k8s jobs.
image

@rivernews
Copy link
Owner Author

rivernews commented May 3, 2020

Problem

  • Travis: connect to redis keeps getting timed out.
  • K8s not very responsive when multi node, job > 2.
    • Is it caused by high CPU? ... but primary node CPU utilization is low. Redis resides on primary node.
    • Does it have sth to do with we assigning redis to primary node? Or should we let k8 decide where redis should reside?
    • Or it has sth to do with our S3 dispatching jobs too many (small job splitting size) + no dispatch delay, so that it occupies redis command transaction too much and redis cannot respond?
    • Is it because our droplet's network did not setup firewall 6378? ...well, our primary node, which redis resides, does have firewall 6378 allowed.
  • Warn: too many k8s jobs may cause glassdoor blocking - since k8 node IP is fixed. We'll get "sign in link is not found" in this case.
    • May also be throttling - when no s3 dispatch delay or too short, will result in burst request against glassdoor.

Tunning Profile

  • Safe approach:
    • No travis. 4 k8s jobs. Node size 4CPU-8 RAM. Misc: all on worker node: scraper + selenium. Splitted size: 1000, total job count: 399. . S3 dispatch speed: 300ms/job.
      • Nature CPU usage: > 3, nature memory usage: 4.5G. (after 7 min starting s3 job)
    • If previous good, try add in ONE travis job. See if no timeout error after we setup firewall.
    • Let's allow more travis jobs, and wait 2-3 cycles to make sure there's no any redis timed out issues.
  • Using the dedicated node pools, let's see if we can push more concurrency from it. If not, we may want to fall back to standard shared droplet.
    • Looks like 1CPU for primary node may be too small, since when s3 dispatches, SLK sandbox got some leaking memory.

    • We'll need to wait for a few days to let Let's Encrypt throttling end.

    • Don't let External DNS delete domain name ... will this help? This may not help, since k8s credentials may need to be retrieved again every time k8s provisions.

    • 🛑we decide to move on until new info surface More issues emerge. Travis job can not locate review panel first, then redis connection got unstable, but managed to reconnect. Then things seem to stop there, no new progress reported. SLK then wait till 10 minutes timed out, then clean up and travis job got canceled. This happens again and again, quite steadily reproduced. Only appears in Travis job. Lack of log level so info also limited.
      image

      • Locate review panel delay ... did we set this too high? Note that the total delay is delay + can not locate timeout. -> We only have sleep 10 + 20 + 30 w/ each timeout 25, which is at most up to 3 minutes.
        • -> Well, remember this happens in locate() block, not parse() block so not many progress publishing there. -> We're trying a few things like adding more chrome options, do additional redis publish progress, but no guratantees. Let's do a benchmark again (4 k8s / 4 travis) and see what's the result. Is definitely has something to do with concurrency, but we don't know why higher concurrency will cause the selenium timeout and redis reconnection.
      • Does redis has something to do? The flow is exactly the same and happens 4 times so far. Should we periodically use redis to remain connection, perhaps add publish progress in retry block as well?
      • Add timestamp to each log
  • 🛑we decide to move on until new info surface After we solve Travis issue - see what's the root cause, we may look how to scale k8s jobs
    • Limitation of vertical scaling: glassdoor will throttle same IP - which means a node can only run up to 4-5 jobs. (7-8 job will cause glassdoor to block)
    • We need a way to scale horizontally.
      • Option 1: Selenium-only node pool + scraper job - only node pool: Only selenium is memory intensive. Both scraper job and selenium can be memory intensive. How to scale nodes for these?
      • Option 2: Try to just use one node pool w/ auto scaling. Perhaps selenium deployment's replica,
        • But need to see how replica make sense for selenium since session is persistent.
        • Only selenium is CPU intensive. So if use autoscale, probably just one node utilizes CPU, other nodes CPU just idle. Memory utilization should be even out and fine though.
      • Better approach probably is to look at scaling selenium on kubernetes, and looks like helm also has selenium release available.
      • Another thing is to limit CPU for selenium, so that it won’t overwhelm API sever and let k8s cluster out of being monitored.
        • Deploy hub + chrome node
        • Set chrome node replica = 2
        • Set scraper worker node pool autoscale, and count/max = 2
        • Set CPU limit for chrome nodes
        • Some issue specifying port on liveness / readiness probe. You may switch to GoDaddy K8 Client, dig into the issue Possible typing mistake in class V1HTTPGetAction kubernetes-client/javascript#444, or just disable the probe thing.
        • Now test replica=2, nodepool/node=2, see if k8 spread chrome nodes evenly on k8 nodes;
          • then run scraper jobs, see if hub spreads session workload evenly on chrome nodes; eventually see if we can run scraper job on different nodes so that we own different IPs to access glassdoor.
  • A couple new things to wrap stuff up
    • If S3 finishes successfully, destroy selenium + node as well
    • Let k8s be used first, set Travis to secondary source unless we still want to test the timeout issue for travis.
  • Frontend add mistake prevention - when selenium deployment is still there, don't allow deleting node.

@rivernews
Copy link
Owner Author

rivernews commented May 10, 2020

Final Milestone

While there're a lot space for improvement, we could set a final milestone here just to achieve two things:

  • Scalable selenium, and can scale up largely
  • Auto scale down to save cost

Steps

  • Test Hub-Node paradigm in production, also for benchmarking workload

  • Configurable scale - currently we have 4 scrapers/k8 node; total of 8. We want to be able to arbitrarily set this number

    • Set a fixed amount on scalability on each node; only scale the number of node. Each node runs 4 scrapers. We hope that 2vCPU/4G would be enough, but we need to see how selenium consumes in the long term Further test on hub-node arch; can only test conservative mode which is 2nodes/4jobs = 8 total. See if timeout still occurs (cannot reach hub error <-- possibly a k8s network issue, in fact hub are mostly available; chrome node initial registration did take some time but only once needed)
      • We keep getting remote driver creation error - unreachable browser. When we use curl to access hub on a different namespace, 1) can resolve IP (there's one time initially can't, but after that it always resolves) 2) sometimes can get response from /wd/hub/status, but a lot of times get port 4444: Connection refused, which must be the main cause of our unreachable error - in this case this is not a big issue with Cilium (perhaps just initially that could not resolve host error), but more of sth on the hub side. If you use port forward, you don't see this issue.
    • If we still can't get the k8s network issue right, or can't figure out root cause of unreachable remote driver, then we might want to try pod archi - pack 1 scraper + 1 standalone selenium into a job.
      • Test conservative mode -- see resource consumption! ... 4vCPU-8GRAM*5/4scraperPerNode/total20 -> 5xx succeed / 5 failed, fail: review panel locating issue
      • Test aggressive mode -- 2CPU * 8 / 3 scraper each / total 24 / splitSize = 500 smaller to avoid error lost ->
      • If both conservative and aggressive mode passed, we can consider adopt pod archi as final solution.
  • We may want to lock down to a working selenium hub / chrome node version.

  • Wrap up, merge PR, close this ticket

Side notes

  • We got a lot of unreachable browser exception when scraper trying to create remote driver session against hub. This issue comment provides a retry-again approach to tackle this problem, simple, a bit brutal force but very effective, especially we're pretty sure the hub capacity is there and ready.
  • The issue comment later on also pointed out that he changed a kubernetes network plugin and it improved the network condition in K8.

@rivernews
Copy link
Owner Author

rivernews commented May 15, 2020

Standlone Approach

Things going well until we face the challenges here
image

As you can see, there's a problem with k8s node assignment algorithm. Before the whole cluster went nuts, you can see besides the 2+2 job-switching overlap, which is dangerous too and we can lower the memory request, but there's an additional job assigned to this node, making it have 5 jobs running concurrently at that moment.

Looks like we can't trust k8s's node assignment. We can lower the memory request so that a job doesn't claim such much memory at the beginning, only claim more when a job needs it - we can do that. But we don't have control over node assignment.

Unless k8s has some additional parameter to configure this, we will need to implement this node distribution algorithm by our own.

Two ways:

  • K8 built-in Pod Topology Spread Constraints. -> But we can't use it, because in v1.16 it's disable by default in feature gate; and DO doesn't allow you to change feature gate settings. Their DO doc said something but didn't give a complete list of their feature gate, I guess they're just using defaults from k8 officials.
  • Use redis semaphore to realize this
    • At the point new job requests semaphore, the old job should already have been releasing it, so we may not have to worry too much about distinguishing semaphore acquiring timeout VS not available, but we can still try to look at it.
    • We may want to look into the code base and see how we can contribute. We need a reset() method.

@rivernews
Copy link
Owner Author

rivernews commented May 17, 2020

Challenges implementing anti-affinity by redis semaphore

The semaphore objects are possibly created across different node processes. Got two issues

  • The release() doesn't work - it says it has no identifier.
    • Error: semaphore k8NodeResourceLock-0b63934f-b75c-4110-9489-ef220d6050cf has no identifier
    • Which will cause the next job error nodeId is empty, did not acquire semaphore successfully.
  • Refresh
    • Refresh interval inside semaphore object does not stop even after semaphore key deleted in redis, causes error: Error: Lost semaphore for key k8NodeResourceLock-8d44fefc-fcec-4822-a145-5a55f1907a14

Some ideas why previous work succeeded:

  • As long as acquire() and release() are executed in the same (child) process, it's ok.
  • Probably only one acquire() session should be executed at a time for a (child) process.
  • If this is true, we can fix this while preserving all the refresh / identifier state functionalities by
    • For each time in acquire(), also store the local semaphore object. This will contain the (single) identifier.
    • Expect the corresponding release() be executed in the same (child) process. Then it will release with the correct identifier!
    • This will work as long as you 1) acquire and release for a specific resource (i.e. against a identifier) within the same (child) process and 2) only acquire once at a time, that is, never do another acquire before previous is released.
    • The refresh issue which causes "Error: lost semaphore for key..." should now be handled by each release().

@rivernews
Copy link
Owner Author

rivernews commented May 17, 2020

Overlap memory usage issue

During job switch, looks like k8s job does not immediately release memory.

  • Did the job actually complete when scraper finish? Is it related to java scraper pubsub adapter is preventing main thread from shutting down?
    • Yes, the job does complete. But the memory utilization does not go down. At the time being when green (next scraper job) starts, the previous job scraper-job-1589753321112 already completed with scraper-job-1589753321112 1/1 2m2s 2m3s, but its memory claim (pink) does not go down, even after a minute.
      image
      Ok, it finally goes down, after almost 5 minutes (pink).
      image
      Another example, at this time point, 748 (green) and 452 (yellow) already completed, but they didn't release their memory immediately, while new jobs are already incoming (orange and blue on top). Eventually they went down after 4-5 minutes, but that already cause a memory spike approaching node memory capacity 4G.
      image
      Related SO question to this. Keyword: gc, garbage collector, complete job pod resource

Some ideas to tackle this situation

  • Delete the job when scraper job finalizes

Shape of one node
image
Another
image
Doesn't look like the overlapping issue is solved, but slightly better than previous. One guess is, the job "object" is deleted, but its pods remain there occupying resources.

Looks like it works perfectly now!
image

And another node
image

Final check

  • We may also try a "strict resource limit" test and see if scraper jobs can survive. If so, probably GC will work harder to recycle, then we may consider raising per-node capacity.
  • 17node/34jobs crashed the entire cluster.
  • The rest is a final check on a full suite s3 job on production. If it runs w/o error, we can merge the pull request! 12node/24jobs looks good, haven't succeed to the end yet succeed!
    • One error in groupon page 110, cannot locate review panel; verified by accessing the current webpage and indeed the review panel vanished. Looks like some bug on their end. No need to fix on our end.
    • Nothing wrong in scaling, got overall 4 errors, all about cannot locate review error, but this shouldn't have to do with our scaling here, so we created another ticket to deal with it Ability to skip a problematic page and move on #75 .
    • Efficiency benchmark: 20:45 -> 5:15, 8h 30min, 487 success / 4 fail / 450 split size, impressive! 🎉
    • At a few time point the max mem util slightly exceed 3G; otherwise not much concerns. However, this may prevent us from scaling up further. Because looks like the SLK node, which is also the k8 core node, is reaching 3.6G/4G.
      image
    • We may need to find a way to let SLK scale. Perhaps SLK having its own node. Keep in mind that current base cost is $20 at idle. If we can dynamically scale and let idle cost be lower than $20, we can afford more when scraper job is in progress.
      • Some bull github issue comments say that you can just have multiple node main process running the same javascript code, pointing to the same redis server, then the sandbox process will automatically be spread across. So what we can try: k8 deployment replica=2 and see in grafana if two SLK instances take workloads evenly.
      • We do want to think of how to scale down SLK. Perhaps a start point is: primary node pool use 1v2G, smaller. A new node pool dedicated for SLK using 1v3G, autoscale 123. SLK deployment initial replica=1. Upon s3 job request, request SLK deployment scale to 2~3, then wait for SLK to scale and have minimum availability, then start s3 job. When s3 finalizes, scale SLK deployment to 1. For node down scale we are hoping k8 will do that automatically for us, but perhaps waiting a bit longer than 5-10 minutes.
  • Can we increase per-node capacity to 3 while having 12 nodes (total of 36 concurrency)? Let's look at the peak memory utilization in each node. Make sure there's up to 1.5G free memory at any given time point.

@rivernews rivernews changed the title Search for better cost-effficiency cloud provider Search for better cost-effficienct cloud provider May 18, 2020
@rivernews rivernews changed the title Search for better cost-effficienct cloud provider Search for better cost-effficient cloud provider May 18, 2020
@rivernews
Copy link
Owner Author

rivernews commented May 19, 2020

Scaling SLK

Recap the workflow

  • Test Bull cross-process concurrency on same node first
    • SLK scale = 2
    • Start s3 job and observe (15 worker nodes, 2 per node)
    • Confirm s3 finish w/o error
  • K8 core set to smaller droplets, 1v2G ... $10
  • K8 new node pool for SLK, 1v3G ... $15 (so idle stage increase from $20 -> $25)
    • Change microservice tf module to make replica, anti node affinity configurable
    • Autoscale 1˜3
  • SLK deployment: nodepool=SLK, replica=1, node anti affinity enable
  • (Start s3 job)
  • Scale up SLK, scale=3, polling to wait for minimum availability
  • (Start jobs)
  • (...)
  • (S3 finalizing)
  • (Scale down selenium stack. Scale down scraper worker nodes)
  • Scale down SLK, scale=1
  • (Finish s3 job)

The core node pool given 1v2G is too small, unstable. We got network and nginx crashed, even if SLK nodes are fine. We then have to upgrade core node pool droplet size. At the end it just lose the point - we can just simply use a 4v8G droplet for both core and SLK. And that's a firm $40 per month. You can scale the entire k8 down by terraform - right, it's not perfect in that sense though. I don't know, maybe we need a separate k8 cluster to "manage" the k8 used for scraper.

Also, we would like to populate many companies currently missing in our database, include:

  • Intel
  • ...

@rivernews
Copy link
Owner Author

The scaling up and down is quite stable right now. Further automation would be really hard to maintain the same cost level around $20-40 monthly and can't really save us money while have the ability to scale up.

Summary

The current max capacity is 60 jobs at max, using core droplet size of 8G RAM. Could be either 4v8G, or memory-optimized 1v8G. Basic monthly bill is $40 at this memory size. But we figured out a way to bypass letsencrypt duplicated certificate, so we can always scale down the entire k8s cluster. Of course we still have to do this in terminal. Would be ideal to have a meta-service running, at least trigger travis job to trigger k8s provision / deletion. Perhaps heroku could be a good place to do this due to its free plan.

The scaling up cost is additional, using 2v4G machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant