-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate Amundsen Microservices #30
Conversation
Signed-off-by: verdan <verdan.mahmood@gmail.com>
f72314a
to
f7f0555
Compare
Signed-off-by: verdan <verdan.mahmood@gmail.com>
6710f5e
to
064f3af
Compare
Would we still be able to access the code as packages? I am concerned that for anyone running a custom implementation of Amundsen this would be impossible to transition into. |
#### Phase II: | ||
In this phase, we’ll deprecate the `amundsenfrontendlibrary` and `amundsencommon` repositories/packages. | ||
We’ll move the react application as-is in the `amundsen` repository. | ||
Next will be to deprecate the frontend codebase completely and call the metadata and search endpoints directly from the frontend react application. This will completely remove the frontend Flask codebase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will change the security model, in that the metadata and search services have mutation endpoints that aren't OK to be exposed unprotected to web. I think people typically secure it by firewalling it off except for approved services, so by virtue of limited functionality from the front-end, it's not a problem.
Should this be figured out now, or in a future RFC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also sometimes nice to be able to define "backend-for-frontend style" APIs; that is, API endpoints that are somewhat specific to the UI in question, rather than being generally-useful unopinionated UIs. I suppose that could be supported with a separate blueprint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few scenarios for deployment:
- different private services use different OSS libraries as dependency. Those deploy service directly EC2 instead of using docker image.
- different private services build different packages based on existing docker image, oss repo with overlap files. They deploy on k8s.
- besides the two cases, quite a few users build its own proxy classes with private overrided method to support different heuristic.
- private API. the metadata / search service support service to service authentication which is not accessible before by users.
Could you clarify how to upgrade / migrate without interruption for the above cases?
- amundsenfrontendlibrary | ||
- amundsenmetadatalibrary | ||
- amundsensearchlibrary | ||
- amundsencommon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So common repo is not just for sharing service common code but is shared by databuilder as well. Its end goal is to use common as schema repo for across all repos including databuilder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't use amundsen for the service repo as we have other repos like gremlin or rds to support different proxies classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amundsencommon is currently not being actively used by the databuilder, and the only usage I found was the index_mapping, which is very specific to ES, hence databuilder.
But I get your point with the end goal of the amundsencommon. My proposal, in that case, would be, to rename amundsencommon to amundsenmodels
in the new architecture, move that within the monorepo, and release it as a PyPi package that can be used in other places like databuilder and external incubator proxies. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one of the pain point is to have databuilder drifted from common. I think the common repo should stay as a separate package as it is and make databuilder fully compatible / depends on it.
from that point, I don't quite get why we put it in single mono-repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the value of having the common config bundled with the rest of the microservices (in terms of development workflow and synchronizing changes) is great enough to outweigh the advantages of having the common repo stand on its own and have databuilder depend on it - especially when databuilder can just depend on the monorepo instead.
- amundsensearchlibrary | ||
- amundsencommon | ||
|
||
This architecture creates the following issues and is making it hard for companies to adopt Amundsen, and to customize it accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having multi-repos have its pros/cons: e.g another lyft LFAI project flyte (https://github.com/flyteorg) has done similar .
Having multiple repos allow if we make metdata service is the true metadata platform/engine and build other frontend on top of it (data X UI).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that each model has its own pros and cons. We will still be able to use the metadata endpoints (even search and custom endpoints) to build other frontends, or use them in other services.
In fact, by using and securing just one service, people will have the access to the complete suite of endpoints, and will still be able to add layers on top of those endpoints to build other applications.
Most of the code is already there and is spread across multiple repositories. I can see this happen in multiple phases. | ||
|
||
#### Phase I: | ||
Merge `amundsenmetadatalibrary` and `amundsensearchlibrary`. This should not take too long and will be easy to maintain the same URL patterns for each under separate directories within one repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you provide more details?
This is not really just about copy code to different repos right?
Each existing service has its own config, even own private API, own private proxy, own docker baked image. what is the path for non disrupted service migration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing,
There will not be a separate release of metadata and search after this phase. We will maintain only one config file and a Docker file.
Since, we are not deleting the old services i.e., amundsenmetadatalibray, and amundsensearchlibrary, users can still migrate to the new architecture in parallel with the old system. Because the data is not changing at all, this is possible.
That would allow you to set up the new monolith, and leave the other system running in parallel. then you would change the front-end to use the monolith as the metadata service and shut down the metadata microservice. then same for search service. This would give a path for people to test their deployment in a way that is de-risked. (I will also add this in the RFC doc :) )
Most of the configurations variable in our services are the same i.e., statsd, logging, swagger, proxy credentials
. So combining those configs will not break anything. A single entry point and requirements will make it easier to have one docker image and PyPI release.
config file will look something like this.
METADATA_PROXY_CLIENTS = {
'NEO4J': 'metadata.proxies.neo4j.proxy.Neo4jProxy',
'ATLAS': 'metadata.proxies.atlas.proxy.AtlasProxy',
'NEPTUNE': 'metadata.proxies.neptune.proxy.NeptuneGremlinProxy'
}
class MetadataConfig:
METADATA_PROXY_HOST = os.environ.get('METADATA__PROXY_HOST', 'localhost')
METADATA_PROXY_PORT = os.environ.get('METADATA__PROXY_PORT', 7687)
METADATA_PROXY = os.environ.get('METADATA__PROXY', 'NEO4J')
# For the users who would like to have custom proxy classes. This also enable us to install and use the custom proxies outside this repository
METADATA_PROXY_CLIENT = os.environ.get('METADATA__PROXY_CLIENT', PROXY_CLIENTS[METADATA_PROXY])
METADATA_PROXY_USER = os.environ.get('METADATA__PROXY_USER', 'neo4j')
METADATA_PROXY_PASSWORD = os.environ.get('METADATA__PROXY_PASSWORD', 'test')
class SearchConfig:
# SAME AS METADATA ABOVE
class Config(MetadataConfig, SearchConfig):
LOG_FORMAT = '%(asctime)s.%(msecs)03d [%(levelname)s] %(module)s.%(funcName)s:%(lineno)d (%(process)d:' \
'%(threadName)s) - %(message)s'
LOG_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S%z'
LOG_LEVEL = 'INFO'
PROXY_ENCRYPTED = True
"""Whether the connection to the proxy should use SSL/TLS encryption."""
# Prior to enable PROXY_VALIDATE_SSL, you need to configure SSL.
# https://neo4j.com/docs/operations-manual/current/security/ssl-framework/
PROXY_VALIDATE_SSL = False
"""Whether the SSL/TLS certificate presented by the user should be validated against the system's trusted CAs."""
IS_STATSD_ON = False
# Configurable dictionary to influence format of column statistics displayed in UI
STATISTICS_FORMAT_SPEC: Dict[str, Dict] = {}
SWAGGER_ENABLED = os.environ.get('SWAGGER_ENABLED', False)
SWAGGER_TEMPLATE_PATH = os.path.join('api', 'swagger_doc', 'template.yml')
SWAGGER = {
'openapi': '3.0.2',
'title': 'Metadata Service',
'uiversion': 3
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how could others contributing to source code without forcing to migrate first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fundamentally we are thinking it quite different. This RFC just makes Amundsen into a standalone Flask App with external plugin that could swap storage while I think in long term the metadata service should evolve into a complex platform which could support different primitive with event triggering and the metadata could support different Apps and not just the data catalog/discovery one .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in long term the metadata service should evolve into a complex platform which could support different primitive with event triggering and the metadata could support different Apps and not just the data catalog/discovery one .
This sounds like an awesome goal to me! But the question is how to get this project to a place where that is a possibility? To me that means it needs to be a mature tool, with a relatively stable api and data model, so that managing changes from version to version is not a nightmare. A structure of the repo that turns the metadata service into a standalone project, once the metadata service is more consistent and well structured, sounds like a great goal for years from now - but in the meantime, using one repo for multiple pip packages maintains most of the benefits of treating the metadata services as a standalone project, while maintaining an easier workflow to get developers involved in the short term.
|
||
#### Phase I: | ||
Merge `amundsenmetadatalibrary` and `amundsensearchlibrary`. This should not take too long and will be easy to maintain the same URL patterns for each under separate directories within one repository. | ||
The new repository will be named `amundsen`. During this period the `amundsenfrontend` repository will be unchanged and will work as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, but amundsen is already used for central repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are multiple ways we can handle this.
- Make use of the existing
amundsen
repository, and will start working in the same repo for the time being. We can have this in a separate directory, but this will break things for users once we move the code out of that directory to the root level. For now there is no setup.py or config or any such files that intersect with our implementation, therefore we can use the repo as is. This will certainly add the noise in the central repo. - Have a
v2
branch for the time being, and have all the implementation in there, and after 2-3 months (this is the time we are expecting for this whole migration) replace the master branch with this v2 branch, and move today's master in alegacy
branch. - Make use of the
main
branch, and have all of our development in there. After we are done with the implementation, makemain
(which is not the default master on Github) the primary branch of the project.
|
||
For each phase mentioned above, we may need to freeze development to the respective repository. Although it should not take too long to merge multiple repositories, there still can be unforeseen issues that delay the process. | ||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not really monorepos, another approach would be single repo with gremlin, rds, databuilder, services but release package separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah...in my opinion it would be better to go full monorepo and include gremlin, rds, databuilder in this consolidation.
Leaving them out preserves all of the development pain of the existing polyrepo approach when making significant changes; I think we could preserve databuilder as its own package with a separate version if that's the main concern.
|
||
This change can pave our ways to move Amundsen towards the next generation event-based system. More contributors and adopters will result in more integrations and diversity of features. | ||
|
||
Ref: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
even you take the other project for example, I know at Li, the GMS is a separate deployment (https://github.com/linkedin/datahub-gma) which dh is only one of the app on top of the backend.
@allisonsuarez I believe this is a pretty common use case, and a few companies (including ING) does some kind of customizations this way. IMO the way to do such customizations will be the same i.e., you clone one repo as a submodule, and overwrite the files/classes the way you are doing today. (via a bash script etc.). Instead of maintaining three different submodules and packages, you will be dealing with just one package to have your desired configuration. Of course, the whole migration is backward compatible in a way that there is no change in the data, you can still inject your custom endpoints, you will still be able to replace/overwrite the files and configs. But yes, there will be some manual work to do to move your code from multiple places to a single place, and rework the deployment. I also believe that this will provide us the opportunity to think of better ways of doing such customizations. I can think of a few ways to achieve a high level of customizations:
|
Sorry, 1 and 2 are not clear to me. Do you mean a company's internal services are using amundsenmetada as dependency and deploying amundsenmetadata directly to EC2 For 3: #30 (comment) |
For 4, any reason it does not change? One of the typical use case is that the metadata service is only accessible from frontend or a whitelist service but not by the user (e.g write badge API, call gdpr asset). Now the same api will be accessible the same user if the could access the frontend. E.g currently our FE service allows VPN to access while metadata service allows FE subnet + ETL cluster subnet but not VPN to access. With single package, it will have VPN, ETL cluster subnet to access which any users who know the endpoint could now direct override the metadata instead of using FE UI. |
I don't want to keep repeating the same discussion on one package vs multiple packages. I am talking about no downtime service migration. There are bunches of users running in prod which update the git package weekly but without touching much other code. I won't accept if we go from multiple packages to one package that will require messy / huge manual hand-crafted service migrations that has downtime. This basically put others stop upgrading. For 1, 2 I know companies do these type of deployment. e.g it uses the pypi version as dependency which runs the service on native ec2. Or it uses the docker image as base image but overlap its own private files. |
To me, putting FE/metadata/search into single package deployment (one pypi, or one docker image) does not follow the long term vision (data discovery + Metadata engine/ platform) that requires the same networking firewall rule. In the end, you will want the same metadata service which could be enhanced with push model (e.g amundsen-io/amundsendatabuilder#431) to serve different Apps with different network firewall rules , different resource requirements. |
|
||
and many more... | ||
|
||
*This change will put all the packages in just one place, making it easier to manage dependencies and no multiple variants of the same package.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got confused by this, you mean we keep the separate packages or not? You meant "merge" all the packages in just one place
?
Thanks @verdan for taking a swing at our development experience and architecture issues in this RFC! In my opinion, although I like the idea of simplifying the architecture, I think this RFC proposes too many changes all at once. Specifically, I think that just moving us into a monorepo with the same package configuration we have right now will fix the issues #1, #2 and #4 as you describe them, while making the migration much less daunting and disruptive. I am not saying that there is no benefits on evolving our architecture and simplifying deploys, but I think all this will be easier to do step by step after we fix the issues with the development experience we have right now. |
Just read this, somewhat of a newb engineer 😅 , but here's some comments w/ our implementation hopefully of use. On dependencies, we are mainly affected w/ databuilder (we use w/ Airflow so we have it in our reqs.txt). The other services are mainly done w/ Docker outside our CICD and monorepo. On development, a single service might make it easier (dependencies, share code-base), but admittedly not so much an issue for us, as most of the things in the immediate future don't require tandem implementation across all services (yet though). We're mostly focused on front-end (if Flask is removed, will there be an OIDC implementation)? Although we run the services behind VPN, we do like having Okta to get usernames, we're looking to expand a single RBAC story across tooling (amundsen/snowflake/looker/airflow). Probably the next thing we might look at would be developing the metadata service (but nothing heavy, maybe just some endpoints to update metadata directly vs. databuilder jobs, i.e. from Airflow, possibly having metadata service trigger Airflow DAGs via API, etc.). I realize I probably don't have a super strong opinion either way. One of the reqs we fulfill though is keeping optionality (i.e. we have kind of hacky thing w/ databuilder to use neo4j 4.x, this was done so we could have the option on switch to SAAS provide like Aura if our Infra team can't manage it in the future. We use AWS for ES because I think it's pretty cheap :) vs eng overhead). |
This is how we're doing customizations today as well, so going from three packages to one sounds like it would make life easier. 👍 |
I have found this general approach to be the most flexible and powerful, and use it extensively in other apps like Airflow. Having a configuration system that lets me override classes with my own custom versions without patching or making a copy of the repo just to overwrite some files is really nice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking on behalf of a team that has done a bit of work contributing our changes to Amundsen, this change would be very welcome - we spent more time trying to figure out how to adapt all the moving parts across multiple repos than we did actually writing code, both for our in-house stuff and the stuff we've been able to contribute back to the community.
I think some of the alternatives that are being discussed have some merit, and I think adopting them could be worth revisiting in the future. Right now, with the graph data model changing rapidly, a mono-repo seems extremely valuable to me. I do have a bit of experience to back this idea up - at Asana, our Work Graph data model is one of the defining features of our product, and coordinating the complexity of our evolving data model would be a serious challenge if we weren't tightly coupled in a monorepo for almost everything touching that data model. There are certainly places where the monorepo causes us pain, and there are pieces of the repo that we may carve out on their own eventually, but the monorepo approach has scaled well with hundreds of engineers contributing to a tool that is based on a graph data model.
*This change will put all the packages in just one place, making it easier to manage dependencies and no multiple variants of the same package.* | ||
|
||
#### 2: Development Efforts | ||
Having multiple repositories makes it really hard to implement a feature. Implementation and testing require efforts to synchronize and then code reviews, and finally, all the PRs across multiple repositories need to land in master at a certain time or at the same time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huge +1 to this - even understanding relatively simple pieces of the data model when developing often involves working across several repos, and then trying to understand which versions of which repos are compatible with one another. Consolidating all of this logic in one place feels like a really obvious win to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ rolling out a feature across several repositories has been a significant pain point for us as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it should definitely speed up dev time as well as make pull requests an easier process given that all of the context will be there.
Although a separate thread has brought up a good point, perhaps databuilder should also be moved into the same repo? But keep separate packages? Otherwise, there will still be some development friction when doing end2end changes.
- amundsenfrontendlibrary | ||
- amundsenmetadatalibrary | ||
- amundsensearchlibrary | ||
- amundsencommon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, the value of having the common config bundled with the rest of the microservices (in terms of development workflow and synchronizing changes) is great enough to outweigh the advantages of having the common repo stand on its own and have databuilder depend on it - especially when databuilder can just depend on the monorepo instead.
Most of the code is already there and is spread across multiple repositories. I can see this happen in multiple phases. | ||
|
||
#### Phase I: | ||
Merge `amundsenmetadatalibrary` and `amundsensearchlibrary`. This should not take too long and will be easy to maintain the same URL patterns for each under separate directories within one repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in long term the metadata service should evolve into a complex platform which could support different primitive with event triggering and the metadata could support different Apps and not just the data catalog/discovery one .
This sounds like an awesome goal to me! But the question is how to get this project to a place where that is a possibility? To me that means it needs to be a mature tool, with a relatively stable api and data model, so that managing changes from version to version is not a nightmare. A structure of the repo that turns the metadata service into a standalone project, once the metadata service is more consistent and well structured, sounds like a great goal for years from now - but in the meantime, using one repo for multiple pip packages maintains most of the benefits of treating the metadata services as a standalone project, while maintaining an easier workflow to get developers involved in the short term.
Since this will change the way we deploy and install the Amundsen project, we will not remove any of the existing repositories i.e., amundsenmetadatalibrary, amundsensearchlibrary, amundsenfrontendlibrary. | ||
We will add a deprecation warning on each readme and packages page where applicable, and introduce the new way i.e., a mono-repo, along the way. All the new work will be done in the new repository. | ||
|
||
## Drawbacks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from brief development freezes, this migration will probably be at least somewhat painful for folks already using amundsen in the short term (having to migrate custom code over etc). Not to say it isn't worth it, but definitely a tradeoff.
- amundsen (frontend, metadata, search, and common repositories) | ||
- amundsendatabuilder (Same as today's databuilder repository) | ||
|
||
There will not be a separate `amundsenmetadatalibrary`, `amundsenfrontendlibrary`, `amundsensearchlibrary` or `amundsencommon` packages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another reviewer has mentioned that maybe this goal should be done as a second stage? The thought here being that we can still get the benefits to productivity from the mono-repo but not take on all of the migration challenges quite yet.
PTAL at #31 -- I've extracted the monorepo portions of this RFC into a single RFC. |
Closing this one for now in favor of #31, and will have a separate RFC for the refactoring. |
The idea is to (eventually) deprecate the split of
amundsenmetadatalibrary
,amundsenfrontendlibrary
,amundsensearchlibrary
andamundsencommon
repositories, and merge them all into one mono repository. The split will instead be based on the backend and frontend.Following will be the repositories (and packages) of Amundsen after the change.
There will not be a separate
amundsenmetadatalibrary
,amundsenfrontendlibrary
,amundsensearchlibrary
oramundsencommon
packages.Signed-off-by: verdan verdan.mahmood@gmail.com