Toward Decentralized HPC by the Minutes: DeepSquare x Akash Network #623
Replies: 14 comments 20 replies
-
Hey Domi! Thanks for putting this up for discussion Have you considered doing a test round with 2 - 3 providers and see how it goes? the only reason I suggest this is because I noticed there are some missing details in your text like number of team members, pay amounts, and providers How much funding and time would you need to do a test round and share initial results and findings How soon will these providers start to earn extra for integrating with your app? What are the risks involved with this? |
Beta Was this translation helpful? Give feedback.
-
Hi Domi, thanks for taking the lead in driving the engagement with Akash. My name is Luca and I have been involved in the DeepSquare project since the very beginning ( 2 years and half). have seen the project developing and worked very closely with Florin and Charley as they can testify. I am also an early investor and I have been an executive advisor to the project contributing to a countless number of activities....so I have been exposed to a number the pros and cons of this project and learnt some hard lessons on what worked and what didn't. I mention this as it would be nonsense to duplicate efforts and errors and sharing what the learnings have been so far to me is key if we really want to make this project take off. I know we truly have something special in our hands, we just need to get together the right people with the right commitment and put our head down to make it happen....I know we can, so let's do it! Goes without saying that you can include me as part of the DeepSquare team and drive/contribute on future activities. In the mean time here below my inputs on the priorities at the back of the learnings gathered in the past 30 months: Akash:
DeepSquare:
In the past years we had multiple exchanges ( 10+) with HPC infra providers, all very interested in what we do, all very supportive, all very excited, BUT, when it came down to DO things , well things started to stall as we were not fitting with their core priorities....so we need to be up there in terms of technology interest and business visibility within Akash otherwise we will just waste time. Happy to deep dive more in the above points and more. Luca |
Beta Was this translation helpful? Give feedback.
-
I can get a discord group chat going to help better organize things. What's your discord @StartmeupLuca |
Beta Was this translation helpful? Give feedback.
-
Ok just sent you a message on Telegram…. I am having troubles in logging in
to Discord… I’ll fix the issue and join the group to initiate the
discussion
Luca Esposito
DeepSquare
+447465981274
***@***.***
…On Sat, 3 Aug 2024 at 08:24, Domi ***@***.***> wrote:
ideally Telegram. @Namaskaram_Luca
mine is @dominikusbrian <https://github.com/dominikusbrian> I tried to
send message to your account. See you around in Telegram then. We also have
a channel inside the DeepSquare Discord with Charly and Florin, and some
others.
—
Reply to this email directly, view it on GitHub
<#623 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BKHG2DXKZL3NI5MKJX3VW6TZPRZSJAVCNFSM6AAAAABKNFESRCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMRSHEYDGNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Just fixed my discord account, can you please add me to the group my
username is StartmeupLuca
Luca Esposito
DeepSquare
+447465981274
***@***.***
…On Sat, 3 Aug 2024 at 08:24, Domi ***@***.***> wrote:
ideally Telegram. @Namaskaram_Luca
mine is @dominikusbrian <https://github.com/dominikusbrian> I tried to
send message to your account. See you around in Telegram then. We also have
a channel inside the DeepSquare Discord with Charly and Florin, and some
others.
—
Reply to this email directly, view it on GitHub
<#623 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BKHG2DXKZL3NI5MKJX3VW6TZPRZSJAVCNFSM6AAAAABKNFESRCVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTAMRSHEYDGNA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
Hey @dominikusbrian This proposal sounds great and I believe it could really benefit Akash, as well as DeepSquare. Like you mention, this can really benefit some Akash providers and Deepsquare to build their infrastructure to onboard retail users and smaller enterprise/organizations. Like @instafinanzas mentions, I think you need to come up with some funding/budget which you think will help you to get started, maybe run a pilot with a few providers and to join some of the sig calls to talk to Akash core team members about this. |
Beta Was this translation helpful? Give feedback.
-
I really appreciate this proposal and am excited about its potential to significantly enhance the capabilities of the Akash Network. The integration with DeepSquare to offer "Rent HPC by the Minutes" is a game-changer, making high-performance computing more accessible to researchers, entrepreneurs, and engineers worldwide. The approach outlined here could open new doors for innovation and scientific discovery. I'm particularly interested in getting involved with this project, especially in a community management role or any other way I can contribute. I've been thinking about similar ideas, and I believe that Akash should definitely be exploring these kinds of initiatives to broaden its impact and utility. |
Beta Was this translation helpful? Give feedback.
-
Updating Q & A related to the proposal above :
On point no. 1 the Cluster in Sion is with the following specs
They have a list of other hardware (22 GPUs) that are currently idle too (worth 600-700k) currently not part of the proposal, but could be included in some way when necessary. A full manifest for these 22 GPUs will be made available as part of the Integrated Whitepaper.pdf in few days. On point no.2 : Not yet, the cluster in Sion is currently off-the-grid, this is part of Phase 1 to set the cluster as full-provider and analyze compatibility between the ClusterFactory (the system that run DeepSquare) and all the settings needed for Akash Providers. On point no.3: The key difference is the task-based orientation of DeepSquare compared to the Deployment-based workload of current Akash system. Currently if this is to be done within the Akash ecosystem, user must have the capability of bulk deploy each of the parallel task as independent Deployment Lease, and need to figure out each hardware specs by themselves, in addition to the need of maintaining all the many parallel running jobs across the ecosystem. This is can be done, but requires a high-degree of technical prowess to do so. For instance this snapshot with spike in April is a testament that such can be done (albeit not each, image attached). Using DeepSquare Portal (the client for DeepSquare) user simply need to submit the overall SLURM script task manifest and the rest will be taken care by DeepSquare Metascheduling and ClusterFactory that handle the setup of each deployment lease and fit as much jobs within each of it. Also the system will maintain the health of each of this deployment and task status, to then automatically re-run or notify if some fails. Thereby, DeepSquare can then allow deploying massively parallel task to be manageable for end-user/developers. To give a specific example, one of my personal use case is to train AI Agents each with different personalized knowledge dataset (I call them PhDBot). So I will need to submit bulk training instances where I loop through different base models, then another loop for different dataset, and finally different hyperparameter space for each model. If I were to do this via akash console.akash.network, for me as end-user it is almost not an option because I have too do so much beyond actually submitting the job manifest, which usually the only thing needed in HPC task when doing so in University's/ Company's own HPC Cluster. Currently with Akash, I can't just submit a single script with 3 loops doing just so. I would need to take care of all the automated deployment through CLI, I need to handle all the complex bidding scheme and need to devise various strategy on hardware requirements for bidding, billing oversight would also need to be taken care manually, I need to manage the oversee of task health myself, and many more. With DeepSquare connected to Akash, this is taken care of, user simply submit a single Workflow file specifying the job manifest using the well known (in HPC context) SLURM scripting. DeepSquare will then scan through all the compute node within the grid (either in their own cluster or across the expandable cloud they connect to) and perform the cross-hardware HPC management, followed with all the above tasks described. User then monitor how all the jobs performing from a single Dashboard. Hope this help understanding the kind of added-value the DeepSquare existing tech-stack can bring to the Akash Network Ecosystem. Would welcome more on-point questions like this either here or on the GitHub discussion. |
Beta Was this translation helpful? Give feedback.
-
Thanks for submitting this proposal. I read through it and am finding it really hard to understand how this fits with the goal & strategy of driving user adoption on Akash. While the main benefit (for akash) mentioned here is "Rent HPC by the Minutes" -- the reality is that, that is already possible on akash. It may not be the easiest to do today for ALL users (it IS for some) but it is possible and it is something that is being worked on from many angles to be improved (as noted below). I think if deepsquare has a bunch of researchers using it (and running SLURM clusters on it) but struggling to get GPUs then this integration may make sense -- and maybe it does have such users?. In the absence of that, to me this seems like a way to try to bootstrap deepsquare rather than to drive adoption for akash. To expand a bit on what I mean. Here are the main options users have today when considering using Akash for their workloads (AI or other):
From what I can tell - deepsquare is trying to build an "Akash-like" service but using SLURM to do so (instead of Akash's approach of using K8s + akash bidding software). This seems unnecessary, if you are intent on driving adoption of Akash. Unless you are intent on building another decentralized compute marketplace - which is fine - but it isn't a collaboration with akash then and takes away from the primary goal of driving adoption on Akash. One of the things Akash struggle's with (I think) when it comes to user adoption is that users have to know how to dockerize their app and write an SDL (YAML) file - this has been a barrier (aside from other things like crypto dependency etc) for many and something the core team and community are working hard to solve with various solutions ranging from building a CI/ CD workflow (like vercel) to providing access to models via an API or a "VM-like" experience with containers. The solution proposed here seems to go in the opposite direction by adding another YAML file that users need to be able to understand and code in: If this was proposing building a client that doesn't worry about orchestration and scheduling (like brev or prime intellect)- it may have made sense to support it -- but even with that, historically, the community has supported two paths when it comes to such clients:
Those things aside:
At the end of the day I think I'll go back to what I said at the top of this comment. This would make sense to pursue if:
|
Beta Was this translation helpful? Give feedback.
-
"DeepSquare HPC Cluster based in Sion Switzerland (12GPUs multi node) after maintenance work is completed" Is that the entire cluster deepsquare has today? is that 12 GPUs or 12 nodes? |
Beta Was this translation helpful? Give feedback.
-
The above paragraph oscillates between suggesting that the Deepsquare cluster will become a provider on Akash Network and also suggesting that Akash Providers will somehow integrate with it (which makes it sound like it is a client?). If it is the first (becoming a provider) then I'm unclear on why there is "integration" work involved. Isn't it essentially setting up a provider (k8s + akash software) like the 60-70 other clusters on the network today are? Or are you suggesting that somehow akash provider software will run on deepsquare's existing provider software (which I presume is a SLURM cluster?) -- which seems like a really hard if not a non-viable solution (given akash runs on K8s) |
Beta Was this translation helpful? Give feedback.
-
Honestly I do not see the point of this proposal.
I would suggest to the DeepSquare team to start slowly and demonstrate practically the benefits of whatever this proposal is about and to start contributing to Akash before asking for funds out of the blue. In this state Europlots will 100% vote no if this ever hits the chain. Sorry for my bluntness but let's get it out. Shimpa |
Beta Was this translation helpful? Give feedback.
-
Hi shimpa, thank you for the frank and detailed comments on this.
Here's the hardware inventory part related to network performance On point no 2. The goal is to integrate DeepSquare technology that enables simple straightforward UX for HPC deployment and the middleware that make that possible on Akash Network. The goal is not simply become a mere provider--though that will be a natural part of it-- the goal is to also enabling DeepSquare technology run alongside, before at the end merged into Akash existing ecosystem after all the test, validation, issues has been resolved. On point no 3. The offering here is for all three aspects of hardware, middleware, and deployment client. Spanning over a whole year of development process comprised of three essential phases and various milestone. The most important of all is the DeepSquare HPC softwares infrastructure that makes this possible to run typical HPC workload and distributed system. On point no 4. Sure, we would be open to gaining more feedback to make the proposal more understanable in terms of the goals and purposes being achieved here. The ultimate goal is to capture Retail HPC users that currently not having access to HPC infrastructure provided by Universities, Enterprise, or National Research Agency, while at the same time doesnt' have the huge funding and expertise needed to build their own HPC system. |
Beta Was this translation helpful? Give feedback.
-
After thorough discussions this proposal is now refocused on to first deliver a feasibility demonstration system build on top of Akash Network. |
Beta Was this translation helpful? Give feedback.
-
Akash x DeepSquare
Project Lead: Domi, Luca, Florin
Objective:
Today, as never before, the demand for computing resources is rapidly increasing, to support the latest technologies such as Artificial Intelligence, Digital Twins, Smart Cities and more. Many research labs, startups, and companies around the world regularly perform scientific computing and high-resolution simulation use cases that often require a network of tenths to hundreds of GPUs and hundreds of CPUs or more to work together solving challenging tasks.
Traditionally these users simply need to work toward getting a Million dollars or more research fund / budget from either funding agency or corporate, to then be able to build their own HPC (High Performance Computing) or rent a full time facility. The cost is extremely prohibitive for many, and leads to the inhibition of economic progress and scientific discovery.
Through DeepSquare and Akash integration, we ought to offer this capability to “Rent HPC by the Minutes” allowing high-end computing tasks to be accessible and affordable to more Researchers, Entrepreneurs, and Engineers across the world. Thereby allowing Akash, not only to capture this rapidly growing high-end computing market share, but to optimize infrastructure utilization by performing multiple jobs spanning over many CPUs/GPUs.
We propose to integrate Akash Network's ever-growing supply of high quality decentralized computing infrastructure with DeepSquare battle-tested Decentralized HPC system which will accelerate Akash time to market for capturing the HPC segment of the market.
Project Timeline [September 2024 - September 2025, 3 Phases]
Pre-Project Preparation [July-August 2024] : Discussion, Proposal Draft, Pitch, and Submission
Phase 1 (09/2024 - 12/2024) : DeepSquare Cluster to become Akash Provider and do RnD on the Integrated System
Phase 2 (01/2025 - 04/2025) : Development of DeepSquare Deployment Module for existing Akash Providers
Phase 3 (05/2025 - 09/2025) : DeepSquare x Akash Blockchain Functionality Integration
Budget / Resource request summary
Total Budget Request of $160,000, Composed of $50,000 / Phase ( with different composition of Dev & Incentive allocation ) + $10,000 extra reserve for volatility and liquidity buffer.
The engineering development budget is to be allocated for various tasks (detailed based on man hours) over the whole project duration. While the rest of the budget will be for technical RnD purposes and community incentives. Details are given within the budget allocation table of the Integration Whitepaper. Project budget will be disbursed and maintained through the publicly visible Multisig AKT wallet with the address : …TBD...
Integration Roadmap
The proposed initial steps to the integration of DeepSquare into Akash will be run in 3 phases.
Phase 1 [DeepSquare to Integrate as Akash Provider]:
Akash Network providers will integrate with DeepSquare Technology by accessing the DeepSquare HPC Cluster based in Sion Switzerland (12GPUs multi node) after maintenance work is completed. The Cluster is highly efficient immersion cooling system capable of reusing the heat generated by the cluster to heat the building hosting it, reducing significantly the CO2 footprint but the oil pump needs to be fixed.This first step funded by Akash Network provider budget, will allow Akash providers to integrate with DeepSquare technology and to resurface a DeepSquare demo environment, ready to be accessed by the providers and used for test-dev/resource and training purposes. This phase will last no more than 4 months (August-September 2024) securing a speed kick off of the integration project. In this way DeepSquare will become a provider on the Akash network and become an active member of the network. In addition, we also have a reserve GPUs and cluster hardware that can be useful for cost-efficient scale expansion when appropriate.
Phase 2 [Development of DeepSquare Integration Module for existing Akash Providers]:
With the Akash providers now having access to the demo and test environment it will be easier for both teams to work on the DeepSquare integration on the Akash Network. This step will require more time and iterations between teams with a project timeline expected to be 5 months (October 2024-February 2025). This phase will be the full integration of the 2 technologies bringing the “Rent HPC by the Minutes” functionality to the whole Akash Network.This phase will also focus on metadata collection to build the foundation models for decentralized HPC infrastructure powered by AI capabilities.
Phase 3 [DeepSquare x Akash Blockchain Functionality Integration]:
Will see the integration of DeepSquare Blockchain functionality used for pricing transparency and workload traceability for Compliance and Audit. This step will be an add-on of the existing Akash billing capabilities specifically for HPC bids and for enhancing HPC user experience. The DeepSquare HPC blockchain powered solution will be integrated into the Akash Console interface as part of the unified user experience we want to provide to users.This phase will be rolled out between Q2 and Q3 2025.
By leveraging DeepSquare Cluster Factory, Meta Scheduling and Grid capabilities we believe the Akash Network will be able to benefit from a true Rent HPC by the Minutes and overall improve: Resource Management and Provider Stability, Provider Bidding Criteria,Hardware and Resource Verification,Direct Bidding Queries and SSL/TLS Certificate Deployment.
Team that make this possible:
Luca, Domi, Florin, DeepSquare x Akash Integration Dev Team.
Luca Esposito: Extended experience in the IT sector worked for large corporations like Oracle and Microsoft,specialized in managing pan EMEA Sales, Pre sales, Consulting and Channel teams.Startups mentor/coach and Angel investor. Luca has been an early investor of the DeepSquare project and advisor to the DeepSquare board since September 2021.
Florin: Dr. Florin Dzeladini is a passionate engineer with expertise in intelligent systems, robotics, AI, and blockchain. With a lifelong love for building and a keen interest in science and technology, he earned a PhD in robotics and intelligent systems. Florin was a founding member of Alpine Intuition and Cohesive Computing, and co-founded DeepSquare.
Dominikus Brian: 5+ Years experience in HPC Infrastructure and application development for Scientific Computing, Full-Stack (Theoretical, Computational, Experimental) Material Scientist, ResearchPreneur in the DePIN, AI, and DeSci space.
DeepSquare Tech Stack GitHub Repository : https://github.com/deepsquare-io
More details can be found in the Integration Whitepaper.
AkashxDeepSquare_IntegrationWhitepaper_v1.pdf
Key Highlight from the Integration Whitepaper
Below are the identified Akash Network challenges DeepSquare solution can Improve on :
1. Resource Management and Provider Stability:
Issue: Ensuring deployments continue running during ISP (Internet Service Providers) dropouts and accurate API reflection of available resources.
DeepSquare Solution: DeepSquare’s architecture supports resilient deployment and robust resource tracking through its decentralized grid and reliable infrastructure monitoring, ensuring stability and accurate resource reporting.
2. Provider Bidding Criteria and Issues:
Issue: Akash providers are not bidding on certain SDLs, specifically related to GPU requirements, while other providers can.
DeepSquare Solution: DeepSquare’s decentralized and transparent resource management can help in accurately matching provider capabilities with SDL requirements. By leveraging its open-source infrastructure and real-time resource tracking, DeepSquare can ensure more precise bidding and resource allocation.
3. Hardware and Resource Verification:
Issue: Providers may have additional requirements like IP leases and persistent storage that prevent bidding.
DeepSquare Solution: DeepSquare’s infrastructure can integrate robust hardware and resource verification mechanisms, ensuring all providers meet the necessary criteria for SDLs before bidding. This can minimize mismatches and enhance bidding accuracy.
4. Direct Bidding Queries:
Issue: Direct bidding using dseq is questioned due to the necessity of post-bid hardware checks.
DeepSquare Solution: DeepSquare’s comprehensive meta-scheduler optimizes job routing and resource checks pre-bid, enabling efficient direct bidding without compromising on resource verification.
5. SSL/TLS Certificate Deployment:
Issue: Complexity in deploying SSL/TLS certificates and managing DNS, with a preference for Cloudflare over self-hosting.
DeepSquare Solution: DeepSquare’s open-source platform can include streamlined guides and integrated solutions for SSL/TLS deployment, reducing the complexity and providing clear DNS management processes.
DeepSquare’s decentralized, transparent, and efficient HPC platform, coupled with its resource management solutions, can address many of the issues currently faced by the Akash Network community. By integrating these solutions, DeepSquare can provide enhanced bidding accuracy, resource verification, stability, and financial management, ultimately improving the overall user experience.
Contact Info:
Luca, luca@deepsquare.io (General DeepSquare Akash Project lead coordinator)
Domi, domi@dreambrooklabs.com (Akash x DeepSquare Integration) [Akash Insider, DreamBrook Labs]
Florin,florin@deepsquare.io (PoC project coordinator)
Charly, charly@deepsquare.io (Collateral and documentation coordinator)
Beta Was this translation helpful? Give feedback.
All reactions