-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate Amundsen Microservices #30
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,154 @@ | ||
- Feature Name: consolidate_amundsen_microservices | ||
- Start Date: 2021-03-23 | ||
- RFC PR: [amundsen-io/rfcs#30](https://github.com/amundsen-io/rfcs/pull/30) | ||
- Amundsen Issue: [amundsen-io/amundsen#0000](https://github.com/amundsen-io/amundsen/issues/0000) (leave this empty for now) | ||
|
||
# Consolidate Amundsen Microservices | ||
|
||
## Summary | ||
|
||
The idea is to (eventually) deprecate the split of `amundsenmetadatalibrary`, `amundsenfrontendlibrary`, `amundsensearchlibrary` and `amundsencommon` repositories, and merge them all into one mono repository. The split will instead be based on the backend and frontend. | ||
|
||
Following will be the repositories (and packages) of Amundsen after the change. | ||
- amundsen (frontend, metadata, search, and common repositories) | ||
- amundsendatabuilder (Same as today's databuilder repository) | ||
|
||
There will not be a separate `amundsenmetadatalibrary`, `amundsenfrontendlibrary`, `amundsensearchlibrary` or `amundsencommon` packages. | ||
|
||
A few reasons why I am keeping the `amundsendatabuilder` out of the monorepo: | ||
- databuilder is a completely separate layer not directly related to the Amundsen project. If I am using Apache Atlas or GCP or any other such proxies, I'd not want to have databuilder code in my repo. | ||
- databuilder is the only piece which will have a lot of integrations, so keeping it separate would reduce the noise from the actual amundesn repo/package. This will enable us to move really fast when introducing new integrations in databuilder. | ||
|
||
**No data will be lost from the database or search engine, as we will not be changing any models or data structure.** | ||
|
||
## Motivation | ||
|
||
At the moment, the Amundsen project is divided into multiple very thin backends or proxies, which are backed by other projects/databases/search services like Neo4j, Atlas, Neptune, etc. | ||
|
||
Following are the Python/Flask based proxies powers the Amundsen project: | ||
- amundsenfrontendlibrary | ||
- amundsenmetadatalibrary | ||
- amundsensearchlibrary | ||
- amundsencommon | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So common repo is not just for sharing service common code but is shared by databuilder as well. Its end goal is to use common as schema repo for across all repos including databuilder. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we can't use amundsen for the service repo as we have other repos like gremlin or rds to support different proxies classes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. amundsencommon is currently not being actively used by the databuilder, and the only usage I found was the index_mapping, which is very specific to ES, hence databuilder. But I get your point with the end goal of the amundsencommon. My proposal, in that case, would be, to rename amundsencommon to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one of the pain point is to have databuilder drifted from common. I think the common repo should stay as a separate package as it is and make databuilder fully compatible / depends on it. from that point, I don't quite get why we put it in single mono-repo There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, the value of having the common config bundled with the rest of the microservices (in terms of development workflow and synchronizing changes) is great enough to outweigh the advantages of having the common repo stand on its own and have databuilder depend on it - especially when databuilder can just depend on the monorepo instead. |
||
|
||
This architecture creates the following issues and is making it hard for companies to adopt Amundsen, and to customize it accordingly. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. having multi-repos have its pros/cons: e.g another lyft LFAI project flyte (https://github.com/flyteorg) has done similar . Having multiple repos allow if we make metdata service is the true metadata platform/engine and build other frontend on top of it (data X UI). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that each model has its own pros and cons. We will still be able to use the metadata endpoints (even search and custom endpoints) to build other frontends, or use them in other services. In fact, by using and securing just one service, people will have the access to the complete suite of endpoints, and will still be able to add layers on top of those endpoints to build other applications. |
||
I've also added a note under each item explaining how this change will solve this issue. | ||
|
||
#### 1: Dependency Management | ||
Python packages are pretty much the same for each proxy, which makes it hard to sync across all the above repositories. As a result most of the time, each proxy is running its own version of the dependency. | ||
|
||
A few examples: | ||
|
||
| Package | amundsenmetadata | amundsenfrontend | | ||
| ------------------ |:------------------:| -----------------:| | ||
| amundsen-common | >=0.8.1 | ==0.6.0 | | ||
| flake8 | ==3.5.0 | ==3.8.4 | | ||
| flake8-tidy-imports | ==1.1.0 | ==4.2.1 | | ||
|
||
and many more... | ||
|
||
*This change will put all the packages in just one place, making it easier to manage dependencies and no multiple variants of the same package.* | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Got confused by this, you mean we keep the separate packages or not? You meant |
||
|
||
#### 2: Development Efforts | ||
Having multiple repositories makes it really hard to implement a feature. Implementation and testing require efforts to synchronize and then code reviews, and finally, all the PRs across multiple repositories need to land in master at a certain time or at the same time. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Huge +1 to this - even understanding relatively simple pieces of the data model when developing often involves working across several repos, and then trying to understand which versions of which repos are compatible with one another. Consolidating all of this logic in one place feels like a really obvious win to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ++ rolling out a feature across several repositories has been a significant pain point for us as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that it should definitely speed up dev time as well as make pull requests an easier process given that all of the context will be there. Although a separate thread has brought up a good point, perhaps databuilder should also be moved into the same repo? But keep separate packages? Otherwise, there will still be some development friction when doing end2end changes. |
||
|
||
*This change results in faster local development, one PR to fix a bug or a feature, no dependency hell as we are facing today, hence attract more contributors.* | ||
|
||
#### 3: Customization & Deployments | ||
One of the frequently mentioned topics re Amundsen is the complexity of the architecture, and the efforts required to customize Amundsen. Having multiple repositories results in multiple components to install, customize, manage and fix. | ||
|
||
*This change will make the customization much easier and the deployment will be pretty straightforward, with only 2 components. No multilevel docker files, easy maintenance of the infrastructure.* | ||
|
||
Custom endpoints will still be injectable the same way we are doing today. We will write a blog/tutorial around each customization. | ||
|
||
#### 4: Code Duplication | ||
From requirements to config files, models, helper functions, exceptions handling, CI/CD pipelines, Docker files, and other repository management like licenses, PR/Issues templates, etc., is pretty much the same in all these microservices. Change in one repository deviates that repository from others at the moment. | ||
|
||
*This change will move all the above into a single repository making it easy to change anything without keeping track of multiple repositories, and redundant code.* | ||
|
||
Having one config file does NOT mean we can not have the incubator integration. It will be the same process as of today i.e., a separate incubation repo, which implements the required methods, and then install that as a requirement and set that as a proxy within the config file. | ||
|
||
|
||
## Transition Path | ||
Most of the code is already there and is spread across multiple repositories. I can see this happen in multiple phases. | ||
|
||
#### Phase I: | ||
Merge `amundsenmetadatalibrary` and `amundsensearchlibrary`. This should not take too long and will be easy to maintain the same URL patterns for each under separate directories within one repository. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you provide more details? This is not really just about copy code to different repos right? Each existing service has its own config, even own private API, own private proxy, own docker baked image. what is the path for non disrupted service migration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure thing, Since, we are not deleting the old services i.e., amundsenmetadatalibray, and amundsensearchlibrary, users can still migrate to the new architecture in parallel with the old system. Because the data is not changing at all, this is possible. Most of the configurations variable in our services are the same i.e., config file will look something like this. METADATA_PROXY_CLIENTS = {
'NEO4J': 'metadata.proxies.neo4j.proxy.Neo4jProxy',
'ATLAS': 'metadata.proxies.atlas.proxy.AtlasProxy',
'NEPTUNE': 'metadata.proxies.neptune.proxy.NeptuneGremlinProxy'
}
class MetadataConfig:
METADATA_PROXY_HOST = os.environ.get('METADATA__PROXY_HOST', 'localhost')
METADATA_PROXY_PORT = os.environ.get('METADATA__PROXY_PORT', 7687)
METADATA_PROXY = os.environ.get('METADATA__PROXY', 'NEO4J')
# For the users who would like to have custom proxy classes. This also enable us to install and use the custom proxies outside this repository
METADATA_PROXY_CLIENT = os.environ.get('METADATA__PROXY_CLIENT', PROXY_CLIENTS[METADATA_PROXY])
METADATA_PROXY_USER = os.environ.get('METADATA__PROXY_USER', 'neo4j')
METADATA_PROXY_PASSWORD = os.environ.get('METADATA__PROXY_PASSWORD', 'test')
class SearchConfig:
# SAME AS METADATA ABOVE
class Config(MetadataConfig, SearchConfig):
LOG_FORMAT = '%(asctime)s.%(msecs)03d [%(levelname)s] %(module)s.%(funcName)s:%(lineno)d (%(process)d:' \
'%(threadName)s) - %(message)s'
LOG_DATE_FORMAT = '%Y-%m-%dT%H:%M:%S%z'
LOG_LEVEL = 'INFO'
PROXY_ENCRYPTED = True
"""Whether the connection to the proxy should use SSL/TLS encryption."""
# Prior to enable PROXY_VALIDATE_SSL, you need to configure SSL.
# https://neo4j.com/docs/operations-manual/current/security/ssl-framework/
PROXY_VALIDATE_SSL = False
"""Whether the SSL/TLS certificate presented by the user should be validated against the system's trusted CAs."""
IS_STATSD_ON = False
# Configurable dictionary to influence format of column statistics displayed in UI
STATISTICS_FORMAT_SPEC: Dict[str, Dict] = {}
SWAGGER_ENABLED = os.environ.get('SWAGGER_ENABLED', False)
SWAGGER_TEMPLATE_PATH = os.path.join('api', 'swagger_doc', 'template.yml')
SWAGGER = {
'openapi': '3.0.2',
'title': 'Metadata Service',
'uiversion': 3
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how could others contributing to source code without forcing to migrate first? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think fundamentally we are thinking it quite different. This RFC just makes Amundsen into a standalone Flask App with external plugin that could swap storage while I think in long term the metadata service should evolve into a complex platform which could support different primitive with event triggering and the metadata could support different Apps and not just the data catalog/discovery one . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This sounds like an awesome goal to me! But the question is how to get this project to a place where that is a possibility? To me that means it needs to be a mature tool, with a relatively stable api and data model, so that managing changes from version to version is not a nightmare. A structure of the repo that turns the metadata service into a standalone project, once the metadata service is more consistent and well structured, sounds like a great goal for years from now - but in the meantime, using one repo for multiple pip packages maintains most of the benefits of treating the metadata services as a standalone project, while maintaining an easier workflow to get developers involved in the short term. |
||
The new repository will be named `amundsen`. During this period the `amundsenfrontend` repository will be unchanged and will work as it is. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry, but amundsen is already used for central repo. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are multiple ways we can handle this.
|
||
|
||
Metadata and Search will have their own Flask blueprints. | ||
- /api/metadata/ | ||
- /api/search/ | ||
|
||
During this phase, we will freeze the codebase for `amundsenmetadatalibrary` and `amundsensearchlibrary` repositories for any new features. Security patches will still be approved and merged. | ||
|
||
A very high level of how the directories will look like after this phase is below: | ||
|
||
<img src="../assets/30/phase1.png" alt="drawing" width="300"/> | ||
|
||
#### Phase II: | ||
In this phase, we’ll deprecate the `amundsenfrontendlibrary` and `amundsencommon` repositories/packages. | ||
We’ll move the react application as-is in the `amundsen` repository. | ||
Next will be to deprecate the frontend codebase completely and call the metadata and search endpoints directly from the frontend react application. This will completely remove the frontend Flask codebase. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will change the security model, in that the metadata and search services have mutation endpoints that aren't OK to be exposed unprotected to web. I think people typically secure it by firewalling it off except for approved services, so by virtue of limited functionality from the front-end, it's not a problem. Should this be figured out now, or in a future RFC? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is also sometimes nice to be able to define "backend-for-frontend style" APIs; that is, API endpoints that are somewhat specific to the UI in question, rather than being generally-useful unopinionated UIs. I suppose that could be supported with a separate blueprint? |
||
All the custom endpoints will be moved to the `amundsen` repository. | ||
|
||
Below is how the folder structure looks like, along with the setup.py file on how to do we set up the dependencies. | ||
|
||
<img src="../assets/30/phase2.png" alt="drawing" width="400"/> | ||
|
||
<img src="../assets/30/setup.png" alt="drawing" width="500"/> | ||
|
||
## How We Communicate This | ||
|
||
- We'll promote this RFC frequently on Amundsen Slack and other social media channels, so to get feedback from the community. | ||
- During the community meetings, update on the progress and phases. | ||
- We'll write a Blog/Tutorial about this change on how to migrate your existing Amundsen deployment to this new architecture. | ||
|
||
Since this will change the way we deploy and install the Amundsen project, we will not remove any of the existing repositories i.e., amundsenmetadatalibrary, amundsensearchlibrary, amundsenfrontendlibrary. | ||
We will add a deprecation warning on each readme and packages page where applicable, and introduce the new way i.e., a mono-repo, along the way. All the new work will be done in the new repository. | ||
|
||
## Drawbacks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Aside from brief development freezes, this migration will probably be at least somewhat painful for folks already using amundsen in the short term (having to migrate custom code over etc). Not to say it isn't worth it, but definitely a tradeoff. |
||
|
||
For each phase mentioned above, we may need to freeze development to the respective repository. Although it should not take too long to merge multiple repositories, there still can be unforeseen issues that delay the process. | ||
|
||
## Alternatives | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is not really monorepos, another approach would be single repo with gremlin, rds, databuilder, services but release package separately. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah...in my opinion it would be better to go full monorepo and include gremlin, rds, databuilder in this consolidation. Leaving them out preserves all of the development pain of the existing polyrepo approach when making significant changes; I think we could preserve databuilder as its own package with a separate version if that's the main concern. |
||
|
||
1. We do not change anything and live with the existing architecture. This will eventually result in dependency hell, redundant code, and a complex network of repositories. | ||
|
||
1. An alternative could be to merge all 4 microservices into a single repository i.e, frontend, metadata, search, and common would simply become `amundsen-io/amundsen` repository. | ||
|
||
For proxies we'll have separate packages. i.e., `amundsen-io/metadataproxy` and `amundsen-io/searchproxy`. | ||
- metaproxy: Will hold the neo4j, atlas, rdbms, neptune etc. | ||
- searchproxy: will have the implementation of elasticsearch, atlas etc., | ||
|
||
We'll install amundsen as `pip install amundsen`, and this will install the amundsen project without any proxy for metadata or search, hence make it a core project without any custom dependencies for anyone. | ||
|
||
Config option in amundsen repository, to select the proxy will be: | ||
|
||
- METADATA_PROXY = "amundsen_metadataproxy.proxy.neo4j" | ||
- SEARCH_PROXY = "amundsen_searchproxy.proxy.elasticsearch" | ||
|
||
and we'll install proxies like this. | ||
|
||
- `pip install amundse-metadataproxy[neo4j]` | ||
- `pip install amundsen-searchproxy[elasticsearch]` | ||
|
||
This will make the whole codebase much lighter and not dependent on any specific proxies or packages like neo4j, apache-atlas, neptune, etc., | ||
|
||
1. Deprecate the split of metadata, search, and amundsencommon repositories, and merge them all into one mono repository, and call this `amundsenbackend`. Remove the Flask based code from the frontend repository and call the metadata/search endpoints directly via the React app. | ||
Summary: The split will instead be based on the backend and frontend. | ||
Following will be the repositories (and packages) of Amundsen after the change. | ||
- amundsenfrontend (Frontend React Application) | ||
- amundsenbackend (frontend flask application, metadata, search, and common repositories consolidated in this repository) | ||
- amundsendatabuilder (Same as today's databuilder repository) | ||
|
||
|
||
## Unresolved questions | ||
TBD | ||
> What parts of this RFC are TBD? | ||
|
||
## Future possibilities | ||
|
||
This change can pave our ways to move Amundsen towards the next generation event-based system. More contributors and adopters will result in more integrations and diversity of features. | ||
|
||
Ref: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. even you take the other project for example, I know at Li, the GMS is a separate deployment (https://github.com/linkedin/datahub-gma) which dh is only one of the app on top of the backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another reviewer has mentioned that maybe this goal should be done as a second stage? The thought here being that we can still get the benefits to productivity from the mono-repo but not take on all of the migration challenges quite yet.