Releases: run-house/runhouse
v0.0.29
Highlights
This release improves autostop stability and robustness considerably, and introduces the ability to send an env or module to a specific node in a multinode cluster.
Improvements
- Simplify and improve Autostop by @rohinb2 and @dongreenberg in #895, #894
- Send env to a specific
node_idx
. by @rohinb2 in #835 - Update secrets login flow to be more opt-in by @carolineechen in #880
- Show information about active function calls in cluster.status() by @rohinb2 in #871 and #896
Bugfixes
- [bug] Make
disable_den_auth
actually sync. by @rohinb2 in #865 - Move config.yaml creation to restart server() by @BelSasha in #868
- Bump SkyPilot Version to 0.6.0 and fix remote SkyPilot dependencies on Start by @dongreenberg in #855
- Consolidate periodic loops into one function updating Den and updating autostop. by @rohinb2 in #873
- Fix cluster factory bug with den_auth clusters not being saved. by @rohinb2 in #878
- Remove resource conversion check for secrets by @carolineechen in #881
Docs
- Clarify setup in docs and den quick start by @mkandler in #876
- Update status docs by @BelSasha in #889
- Llama 3 vLLM GCP example by @mkandler in #893
- Fix bug in starting example code block by @mkandler in #884
- Adds quotes to pip install in examples by @mkandler in #886
- Update secrets login in api tutorial by @carolineechen in #882
Testing
- Update multinode cluster fixtures. by @rohinb2 in #856
- minor changes to cluster status tests by @BelSasha in #891
- Group status tests together by @dongreenberg in #899
- Reorganzize default env tests and consolidate fixture into GCP fixture by @dongreenberg in #900
- Stop overwriting local dotenv in tests. by @dongreenberg in #901
- Consolidate static cluster fixtures into one by @dongreenberg in #902
- Change AutostopServlet into AutostopHelper, and test properly by @dongreenberg in #897
- cluster status scheduler tests by @BelSasha in #869
Full Changelog: v0.0.28...v0.0.29
v0.0.28
Highlights
runhouse status
: Improving visibility into cluster utilization and memory consumption
Improved Cluster Status
Runhouse now provides a more comprehensive view of the cluster's utilization and memory consumption, providing more coverage over the true utilization numbers across each worker and head node of a the cluster.
Information surfaced includes: PID, CPU utilization, memory consumption, and GPU utilization (where relevant).
This data can be viewed as part of the runhouse status
CLI command:
GPU Cluster
>> runhouse status
/sashab/rh-basic-gpu
😈 Runhouse Daemon is running 🏃
Runhouse v0.0.28
server pid: 29486
• server port: 32300
• den auth: True
• server connection type: ssh
• backend config:
• resource subtype: OnDemandCluster
• use local telemetry: False
• domain: None
• server host: 0.0.0.0
• ips: ['35.171.157.49']
• resource subtype: OnDemandCluster
• autostop mins: autostop disabled
Serving 🍦 :
• _cluster_default_env (runhouse.Env)
This environment has only python packages installed, if such provided. No resources were found.
• np_pd_env (runhouse.Env) | pid: 29672 | node: head (35.171.157.49)
CPU: 0.0% | Memory: 0.13 / 16 Gb (0.85%)
• /sashab/summer (runhouse.Function)
• mult (runhouse.Function)
• sd_env (runhouse.Env) | pid: 29812 | node: head (35.171.157.49)
CPU: 1.0% | Memory: 4.47 / 16 Gb (28.95%)
GPU: 0.0% | Memory: 6.89 / 23 Gb (29.96%)
• sd_generate (runhouse.Function)
CPU cluster
>> runhouse status
/sashab/rh-basic-cpu
😈 Runhouse Daemon is running 🏃
Runhouse v0.0.28
server pid: 29395
• server port: 32300
• den auth: True
• server connection type: ssh
• backend config:
• resource subtype: OnDemandCluster
• use local telemetry: False
• domain: None
• server host: 0.0.0.0
• ips: ['52.207.212.159']
• resource subtype: OnDemandCluster
• autostop mins: autostop disabled
Serving 🍦 :
• _cluster_default_env (runhouse.Env)
This environment has only python packages installed, if such provided. No resources were found.
• sd_env (runhouse.Env) | pid: 29716 | node: head (52.207.212.159)
CPU: 0.0% | Memory: 0.13 / 8 Gb (1.65%)
This environment has only python packages installed, if such provided. No resources were found.
• np_pd_env (runhouse.Env) | pid: 29578 | node: head (52.207.212.159)
CPU: 0.0% | Memory: 0.13 / 8 Gb (1.71%)
• /sashab/summer (runhouse.Function)
• mult (runhouse.Function)
Improvements
- Cluster status displays additional information. (#653)
- Polling den with cluster status data (#806)
- Prevent exposing user Runhouse API tokens on the cluster by saving a modified hashed API token (#797)
- Use env vars in default env creation (#798)
- Login flow improvements (#796)
Bug Fixes
- Fix undefined path when pip installing a folder (#826)
- Don't pass basic auth to password cluster HTTP calls (#823)
- Fix env installations that contain a provider secret (#822)
- Refresh sys.path upon loading a new module (#818)
Docs & Examples
v0.0.27
Highlights
Custom cluster default env support and lots of new examples!
Cluster Default Env
Runhouse cluster now supports a default_env
argument to provide more flexibility and isolation for your Runhouse needs. When you set up a cluster with the default env, Runhouse first installs the env on the cluster (any package installations and setup commands), then starts the Runhouse server inside that env, whether it be a bare metal or even conda env. Future Runhouse calls on/to the cluster, such as cluster.run(cmd)
, rh.function(local_fn).to(cluster)
, and so one, will default to run on this default env. Simply pass in any runhouse Env object, including it's package requirements, setup commands, working dir, etc, to the cluster factory.
my_default_env = rh.env(
name="my_default_env",
reqs=["pytest", "numpy"],
working_dir="./",
)
my_conda_env = rh.conda_env(name="conda_env1", env_vars={...}) # conda env
cluster = rh.cluster(
name="my_cluster",
instance_type="CPU:2+",
default_env=my_default_env, # or my_conda_env
)
cluster.run(["pip freeze | grep pytest"]) # runs on default_env
Improvements
- Introduce support for custom cluster default env (#678, #746, #760)
- Start our own Ray cluster instead of using SkyPilot's (#742)
- Exception handling for Module (#747)
- Disable timeout in AsyncClient (#773)
- Only sync rh config to ondemand cluster (#782)
Bug Fixes
- Set CPUs for ClusterServlet to 0 (#772)
- previously, cluster servlet was taking up 1 cpu resource on the cluster. set this to zero instead
- Set den_auth default to None in cluster factory (#784)
- non-None default argument causes the cluster to reconstruct from scratch (rather than reloaded from rns) if there's a non-None argument mismatch
Docs & Examples
See also docs and examples webpages.
New Examples
- Llama3 (#741, #743, #744)
- Parallel embedding (#759, #779, #783, #792)
- Hyperparameter optimization (#770)
- Llama2 fine-tuning with LoRA (#771)
New Tutorials
- Async tutorial in docs (#768)
v0.0.26
v0.0.25
Improved parallelism, clearer exceptions, and saving resources within Den orgs
Improvements
- Improve the thread, reference, and fault tolerance model for EnvServlet ray actors (#735, #733, #736, #734, #737)
- Catch all non-deserializable exceptions client-side (#730)
- Support for saving resources on behalf of an org (#676, #732)
Bugfixes
- Dynamically set
API_SERVER_URL
(#708) - Move OMP_NUM_THREADS setting into servlet to avoid setting it on import by (#731)
Full Changelog: v0.0.24...v0.0.25
v0.0.24
Fast-follow bugfixes for CPU parallelism and log streaming
Bug fixes
- Fix ray persistently setting OMP_NUM_THREADS=1 (#723)
- Fix method call log streaming by unbuffering stdout/err in call threadpool (#724)
Full Changelog: v0.0.23...v0.0.24
v0.0.23
Richer async support, performance improvements, and bugfixes
Improvements
- Client-side Async support (#690, #696, #696, #689) - We've improved the way we handle async calls to remove modules. Now, you can properly unblock the event loop and await any remote call by passing
run_async
as an argument into the method call. If your method is already defined as async, this will be applied automatically without specifyingrun_async
so your code canawait
the remote method just as it did the original. You can still explicitly setrun_async=False
in that case to make the local call sync. - Improve Mapper ergonomics and docs (#700, #709) - Now you can simply pass a function to the mapper and it will send over the module and create replicas on its own. We'll publish new mapper tutorials shortly.
- Cache rich signature for Module to improve method call performance (#699)
- Don't serialize tracebacks in OutputType.EXCEPTION (#721) - Sometimes exceptions can't be deserialized locally because they depend on remote libraries. In those cases, we now still print the traceback for better visibility.
- Unset OMP_NUM_THREADS when Ray automatically set it because it may break user parallelism expectations (#719)
Bugfixes
- Fix stdout and logs streaming in various scenarios (#716, #717)
- Remove unused
requests.Session
created in HTTPClient (#694) - Change Caddy installation to download from Github (#702) (Sorry Caddy!)
- Inherit Cluster READ access for resources on the cluster (#706)
- Set the cluster name in the HTTPClient upon rename (#704)
- Fix some
runhouse login
bugs (#717) - Make errors from Den include status code and be more verbose (#707)
- Fix SkySSHRunner tunnels and processes to be correctly cleaned up (#718)
Full Changelog: v0.0.22...v0.0.23
v0.0.22
Performance improvements + bug fixes
Improvements
- Add to open_ports when creating new on demand cluster (#651)
- Updates to Sagemaker Cluster (#654)
- Change
AuthCache
logic to request per keypair (#684)
Performance Improvements
- Cache various module/function computations (#661, #665, #662)
- Async daemon side components (#656, #664, #673, #674, #670)
- Use ThreadPoolExecutor to synchronous function calls on server side (#663)
- Decrease log wait time (#685)
Bug Fixes
- Fix bug with json serialization for exceptions (#655)
- Update returned exceptions to be json serializable.
- Use shell for running cmd in env servlet (#667)
- Previously shell commands would not consistently work.
- Fix cluster autostop (#672, #681, #683)
- Change to correctly set and update last activity time and do it in a background thread
- Fix multinode cluster ips (#681)
- Cluster ips previously computed from cached ips and would incorporate stale ones. Update to use only current ips.
Examples
v0.0.21
Some performance and feature improvements, bug fixes, and new examples.
Improvements
- OpenAPI pages for cluster (#579, #586, #587, #589, #590)
- Properly raise exceptions in Module's
load_config
when dependency is missing (#595) - Kill Ray actors by default during
runhouse stop
(#596) module.to(rh.here)
throws error if local server is not initialized (#597)- Send exceptions in
data
field (#602) - Run commands inside env servlet (#603)
- Return exceptions instead of
None
in failed mapper replicas (#605) - Remove
sshtunnel
library dependency (#625, #634, #640) - Don't save cluster secret during cluster init (#633)
- Remove creds from cluster's config file (#637)
Performance
- Use
check_server
instead ofis_up
with refresh for ondemand cluster endpoint (#614) - Remove
register_activity
calls within env servlet (#629)
Bug Fixes
- Install aws dependencies properly for
runhouse[aws]
(#613) - Fix env servlet name in
put_resource
(#626)- Env servlet was using conda env name instead of env resource name.
- Fix SkySSHRunner local and remote port ordering (#630)
BC-Breaking
- Remove previously deprecated items (#624)
reqs
andsetup_cmds
inrh.function.to()
removed. Pass it into theenv
instead.access_type
removed inResource
andshare
. Useaccess_level
instead.- global pinning methods removed. Use
rh.here.put/get/delete/keys/clear
instead.
- Deprecate and raise exception for passing system into function/module factories (#625)
- Passing in
system
torh.function/module
does not send code to the system and can be misleading. Use.to
orget_or_to
to sync code to the cluster.
- Passing in
Examples
See rendered examples on https://www.run.house/examples
New Examples
- Mistral 7B Inference with TGI on AWS EC2 (#585, #604)
- Mistral 7B Inference on AWS Inferentia (#609)
- Langchain RAG App on AWS EC2, with Custom Domain (#607, #621)
- Llama2 on EC2 A10G (#608)
- Llama2 Inference with TGI on AWS EC2 A10G (#610)
Updates
v0.0.20
Highlights
Cluster Sharing
We’ve made it easier to share clusters across different environments and with other users. You can now share and load a cluster just as you would any other resource.
my_cluster = rh.cluster("rh-cluster", ips=[...], ...)
my_cluster.share(["user1@email.com", "username2"])
# load the box with
shared_cluster = rh.cluster("owner_username/rh-cluster")
Shared users will be able to seamlessly run shared apps on that cluster, or SSH directly onto the remote box. To enable this, we persist the SSH credentials for the cluster as a Runhouse Secret object, which can easily be reloaded when another user tries to connect.
Improved rh.Mapper
rh.Mapper
was first introduced in runhouse v0.0.15, an extension of functions/modules to handle mapping, replicating, and load balancing. Further improvements and some bug fixes were included in this release, plus a BC-breaking variable name (see section below).
def local_sum(arg1, arg2, arg3):
return arg1 + arg2 + arg3
remote_fn = rh.function(local_sum).to(my_cluster)
mapper = rh.mapper(remote_fn, replicas=2)
mapper.map([1, 2], [1, 4], [2, 3])
# output: [4, 9]
Improvements
- Use hashed subtoken for cluster requests (#270)
- Simplify storage of SSH creds for more reliable cluster access across environments and users (#479)
- Remove sky storage dependency (#415)
- Replace subprocess check_call with run (#503)
- Serialize exceptions properly (#516)
- Improved Logging
Bug Fixes
- Mapper bug fixes (#539)
Deprecation
BC-Breaking
rh.mapper
factory function args renamingnum_replicas
->replicas
replicas
->concurrency
Docs
See updated tutorials on Runhouse docs
- New quick start guides -- local, cloud, and Den versions
- Updated API tutorials -- clusters, functions & modules, envs, folders
Examples
See new Runhouse examples on GitHub or webpage
- Llama2 inference on AWS EC2
- Stable Diffusion XL 1.0 on AWS EC2
- Stable Diffusion XL 1.0 on AWS Inferentia
Other
- Remove paramiko as server connection type