Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Integration][Gitlab] Added support for gitlab member ingestion #767

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

mk-armah
Copy link
Member

@mk-armah mk-armah commented Jul 3, 2024

Description

Added support for gitlab member ingestion

  • Gitlab group members endpoint does not contain users emails by default or for free plan, to get emails, an account is required to be either an enterprise or self hosted.
  • To enable free plan users to view the user emails as well, we add a flag publicEmailVisibility to the member kind. If set to true, members are enriched with public_email from the /user endpoint, but this is highly dependent on whether the user being query has allowed public email visibility on the gitlab account. learn more
    Also note that it was necessary that we call the /users endpoint because members do not contain public_email. Default value for publicEmailVisibility filter is false.
  • Lastly, gitlab returns all members and tokens created as members, to enable filtering out bots from actual members we add a flag filterBots to the port-app-config (top level) to filter out bots if set to true, default is true. bots filtering is needed when syncing groups and group members, hence the need for putting filterBots on top-level of the port-app-config

Type of change

Please leave one option from the following and delete the rest:

  • New feature (non-breaking change which adds functionality)

Screenshots

Screenshot 2024-07-25 at 7 01 49 PM Screenshot 2024-07-25 at 7 34 48 PM Screenshot 2024-07-25 at 7 38 20 PM

@github-actions github-actions bot added the size/L label Jul 3, 2024
@mk-armah mk-armah requested a review from a team July 3, 2024 12:14
@Tankilevitch Tankilevitch changed the title PORT-7708 | Added support for gitlab member ingestion [Integration][Gitlab] Added support for gitlab member ingestion Jul 7, 2024
Comment on lines 95 to 100
"publicEmail": {
"type": "string",
"title": "Public Email",
"description": "User's GitLab public email.",
"icon": "User",
"format": "user"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what case the user will have a public email? I think we might want to remove it by default

Comment on lines 33 to 35
relations:
gitlabGroup: '[.__groups[].full_path]'
createdBy: .created_by.username
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the blueprint relation you only have gitlabGroup, while in your mapping you have both createdBy and gitlabGroup. I am not sure it is of interest for users who created the user, lets remove it. Let me know if you think otherwise

Comment on lines 559 to 568
async def check_group_membership(group: Group) -> Group | None:
"check if the user is a member of the group"
async with semaphore:
try:
await AsyncFetcher.fetch_single(group.members.get, member.get_id())
return group
except GitlabError as err:
if err.response_code != 404:
raise err
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels very insufficient, eventually for all the if I have 1 member, and 1000 groups, I'll have to query this api 1000 times to find out what groups he related to? isn't there any other way to get that data?

Comment on lines 607 to 611
user_groups: List[dict[str, Any]] = [
{"id": group.id, "full_path": group.full_path}
async for groups in self.get_member_groups(user)
for group in groups
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this should be the default behavior when quering members.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be something that is part of group members kind? that will query the list of members of each group.
and instead of having a relation of user -> groups, we will have relation between group -> users

Comment on lines 581 to 601
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:

logger.info(f"Fetching all members of group {group.name}")

async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
yield members

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:
logger.info(f"Fetching all members of group {group.name}")
async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
yield members
async def get_all_group_members(
self, group: Group
) -> typing.AsyncIterator[List[GroupMemberAll]]:
logger.info(f"Fetching all members of group {group.name}")
async for users_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members_all.list,
validation_func=self.should_run_for_member,
pagination="offset",
order_by="id",
sort="asc",
):
members: List[GroupMemberAll] = typing.cast(
List[GroupMemberAll], users_batch
)
logger.info(
f"Queried {len(members)} members {[user.username for user in members]} from {group.name}"
)
members_enriched_with_group = [ {...member, group: group} for member in members ]
yield members

async def enrich_member_with_groups_and_public_email(
self, member
) -> dict[str, Any]:
user: User = await self.get_user(member.id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feature flag in the selector mapping, and defined in docs.

Comment on lines 212 to 222
@ocean.on_resync(ObjectKind.MEMBER)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
async for members_batch in service.get_all_group_members(group):
tasks = [
service.enrich_member_with_groups_and_public_email(member)
for member in members_batch
]
members = await asyncio.gather(*tasks)
yield members
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
@ocean.on_resync(ObjectKind.MEMBER)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
async for members_batch in service.get_all_group_members(group):
tasks = [
service.enrich_member_with_groups_and_public_email(member)
for member in members_batch
]
members = await asyncio.gather(*tasks)
yield members
@ocean.on_resync(ObjectKind.GROUP_MEMBERS)
async def resync_members(kind: str) -> ASYNC_GENERATOR_RESYNC_TYPE:
for service in get_cached_all_services():
for group in service.get_root_groups():
group_members = []
async for members_batch in service.get_all_group_members(group):
group_memebers.append(members_batch)
yield { group: group, group_members: group_members }

stream_async_iterators_tasks

Comment on lines 109 to 114
"gitlabGroup": {
"title": "Group",
"target": "gitlabGroup",
"required": false,
"many": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing group

Comment on lines 69 to 74
"locked": {
"type": "string",
"title": "Locked",
"icon": "GitLab",
"description": "Indicates if the GitLab item is locked."
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this locked?

Comment on lines 62 to 68
"properties": {
"state": {
"title": "State",
"type": "string",
"icon": "GitLab",
"description": "The current state of the GitLab item (e.g., open, closed)."
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are talking about member not gitlab item, please make sure its readable and straight forward for the users

Comment on lines 110 to 123
"visibility": {
"icon": "Lock",
"title": "Visibility",
"type": "string",
"enum": [
"public",
"internal",
"private"
],
"enumColors": {
"public": "red",
"internal": "yellow",
"private": "green"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add description

publicEmail: .__public_email
relations:
gitlabGroup: '[.__groups[].full_path]'
createdBy: .created_by.username
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is createdBy we don't have that kind of relation please remove

integrations/gitlab/gitlab_integration/ocean.py Outdated Show resolved Hide resolved
locked: .locked
link: .web_url
email: .email
publicEmail: .__public_email
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove by default

Comment on lines 142 to 148
"relations": {
"members": {
"title": "Members",
"target": "member",
"required": false,
"many": true
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't be here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relationship is group -> member

Comment on lines +16 to +17
User,
GroupMember,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both, depends on the context

Comment on lines 126 to 132
class MembersSelector(Selector):
public_email_visibility: bool | None = Field(
alias="publicEmailVisibility",
default=False,
description="If set to true, the integration will enrich members with public email field. Default value is false",
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize the class outside of GitlabMembersResourceConfig that way we would be able to re-use it

Comment on lines 332 to 334
cached_groups = event.attributes.setdefault(GROUPS_CACHE_KEY, {}).setdefault(
self.gitlab_client.private_token, {}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not use this cache, we already have a prebuilt one in ocean core

Comment on lines 127 to 128
public_email_visibility: bool | None = Field(
alias="publicEmailVisibility",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enrich_with_public_email

Comment on lines 146 to 150
filter_bots: bool | None = Field(
alias="filterBots",
default=False,
description="If set to true, bots will be filtered out from the members list. Default value is false",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of the group selector and not of all resources

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both Members and Groups depend on this parameter. Removing from top level means I have to include it in integration for groups and members. Please confirm this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I placed it at the top level to keep a consistent behavior in this system since groups are related to user. I strictly expect the value of filterBots to be consistent for groups and members. What could go wrong is that a user might specify this parameter as false for groups kind and true for member kind. Due to the relationship between members and groups, the catalog will be populated with extra inconsistent data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a comment above the filter_bots so other developers will understand your motivation.
Also I would rename it to include_member_bots and default should be false

GROUPS_CACHE_KEY = "__cache_all_groups"
MEMBERS_CACHE_KEY = "__cache_all_members"

MAX_CONCURRENT_TASKS = 30
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? how can we actually validate and handle it? we want to be able to handle the rate limits most of third parties return headers, but we don't use gitlab api straightforward but rather through the client.

Here are some notes on the rate limits

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually just found this one - https://python-gitlab.readthedocs.io/en/stable/api-usage-advanced.html#rate-limits which means that gitlab client handles this one for us, so we are good 👍

return

async def enrich_group_with_members(self, group: Group) -> dict[str, Any]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redundant line

Comment on lines +573 to +574
async for members_batch in AsyncFetcher.fetch_batch(
fetch_func=group.members.list,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why group.member.list and not group.member_all.list ?
https://python-gitlab.readthedocs.io/en/stable/gl_objects/groups.html#id10
what do you think about adding this as an option to query all rather than only in one hierarchy?

Comment on lines 226 to 232
if selector.public_email_visibility:
yield [
await service.enrich_member_with_public_email(member)
for member in members
]
else:
yield [member.asdict() for member in members]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when I have the same user in multiple groups? how would that behave? will I have to perform repeated upserts?

maybe in this method ^ we should use members = group.members_all.list(get_all=True) which will return all and reduce the amount of extra requests that we will have to perform?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my reason for using members as opposed to member_all was because the members_all request returns not the the user in that group but also all inherited and invited members.

Thereby resulting in all groups the same members since the members most commonly belong to the parent group - details

aside this, the behavior and how we will retrieve members does not differ from member and members_all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i understand what you are saying, so if we use the members_all only for the members kind wouldn't it reduce the amount of requests by a lot? as we will only have to bring the members for the root groups rather than the subgroups as well.

also just making sure that you have tested subgroups as well. please confirm

Copy link
Member Author

@mk-armah mk-armah Jul 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling /members on root groups returns the same results as /members/all, calling /members on subgroups comes with less data than members/all, due to the exclusion of invited and inherited members, for root groups, concept of inherited members does not apply, all members of the root groups are returned regardless how we choose to call them.

I believe the optimization here was getting members from root groups instead of all groups (including subgroups), which would have taken more time.

Copy link
Contributor

@Tankilevitch Tankilevitch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also missing webhook handling for new members.

Comment on lines 31 to 32
relations:
members: '[.__members[].username]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding members should be optional for groups, so customers will be able to decide whether they want to have it or no

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and need to check that creating a new group, will sync the members correctly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair, how about project -> group, same ?

Comment on lines 225 to 239
if selector.enrich_with_public_email:
enriched_member_tasks = [
service.enrich_member_with_public_email(member)
async for members in service.get_all_group_members(group)
for member in members
]
enriched_members = await asyncio.gather(*enriched_member_tasks)
return enriched_members
else:
member_dicts = [
member.asdict()
async for members in service.get_all_group_members(group)
for member in members
]
return member_dicts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the get_all_group_members once and then if enrich_with_public_email spawn requests for the members

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, the code will look like this then

`members = [
member
async for members_batch in service.get_all_group_members(group)
for member in members_batch
]

if selector.enrich_with_public_email:
enriched_member_tasks = [
service.enrich_member_with_public_email(member)
for member in members
]
enriched_members = await asyncio.gather(*enriched_member_tasks)
result = enriched_members
else:
result = [member.asdict() for member in members]

return result`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll refactor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is better for readability, thanks

Comment on lines 224 to 246
async def fetch_group_members(service, group):
if selector.enrich_with_public_email:
enriched_member_tasks = [
service.enrich_member_with_public_email(member)
async for members in service.get_all_group_members(group)
for member in members
]
enriched_members = await asyncio.gather(*enriched_member_tasks)
return enriched_members
else:
member_dicts = [
member.asdict()
async for members in service.get_all_group_members(group)
for member in members
]
return member_dicts

for service in get_cached_all_services():
async for groups in service.get_all_groups(skip_validation=True):
group_tasks = [fetch_group_members(service, group) for group in groups]
for group_task in asyncio.as_completed(group_tasks):
group_members = await group_task
yield group_members
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused which is one is the right one? you both return members_dics and the group_members in the same resync method? how should it work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Members dict is returned by the fetch_group_members function which is responsible for making the decision as to whether to enrich or not enrich a group member #

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at group_tasks = [fetch_group_members(service, group) for group in groups], we leverage fetch_group_members to fetch all members of the each group and process them as completed.

Comment on lines +114 to +115
tasks = [service.enrich_group_with_members(group) for group in groups_batch]
enriched_groups = await asyncio.gather(*tasks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we do such things, we should add it under a feature flag, as we wouldn't want to just do all those extra requests to get the members if it wasn't actually intended by the user.

Therefor I suggest leaving the group kind as it was.
And add the following kinds.

groupMembers - which will do this exact logic that you have implemented in resync_groups - this kind will allow users to decide whether they want to export the group with or without members.

member / user - which will bring the information about a user. such as the email etc..

Comment on lines 219 to 245
gitlab_resource_config: GitlabMembersResourceConfig = typing.cast(
GitlabMembersResourceConfig, event.resource_config
)
selector = gitlab_resource_config.selector

async def process_group_members(service, group):
members = [
member
async for members_batch in service.get_all_group_members(group)
for member in members_batch
]

if selector.enrich_with_public_email:
enriched_member_tasks = [
service.enrich_member_with_public_email(member) for member in members
]
enriched_members = await asyncio.gather(*enriched_member_tasks)
return enriched_members

return [member.asdict() for member in members]

for service in get_cached_all_services():
async for groups in service.get_all_groups(skip_validation=True):
group_tasks = [process_group_members(service, group) for group in groups]
for group_task in asyncio.as_completed(group_tasks):
group_members = await group_task
yield group_members
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen when a member is in multiple groups? will it get overwritten in port with the latest group that was fetched?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it would, a function to cache members and return only unsynced / uncached members can solve this. So we don't have to hit port with members that have been synced already

@mk-armah mk-armah requested a review from a team as a code owner August 21, 2024 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants