Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the iBridges search more easy to use #239

Merged
merged 32 commits into from
Aug 19, 2024
Merged

Make the iBridges search more easy to use #239

merged 32 commits into from
Aug 19, 2024

Conversation

qubixes
Copy link
Collaborator

@qubixes qubixes commented Aug 7, 2024

Significant API change for the search function.

TODO:

  • Documentation
  • Tutorials
  • CLI
  • Update/improve tests

Fixes #233
Fixes #229
Fixes #230

@qubixes qubixes marked this pull request as draft August 7, 2024 13:54
@qubixes qubixes requested a review from chStaiger August 7, 2024 13:54
Copy link
Member

@chStaiger chStaiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new structure of the code. It makes it much more readable.
I think we need to add some good test examples to see whether the search retrieves all or not too many results.

  • path_pattern = %/my_coll/%books, where books is a collection
  • path_pattern = %/my_coll/%book.txt
  • Testing the % in the metadata and in the checksum.
  • ...

ibridges/search.py Outdated Show resolved Hide resolved
ibridges/search.py Outdated Show resolved Hide resolved
ibridges/search.py Outdated Show resolved Hide resolved
ibridges/search.py Outdated Show resolved Hide resolved
Copy link
Member

@chStaiger chStaiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the whole I must say, the code and the usage of the function is really nice.
I just find the creation of the MetaSearch object less intuitiv than creating a dictionary and passing that to the search function to pass the metadata criteria.

@qubixes
Copy link
Collaborator Author

qubixes commented Aug 12, 2024

On the whole I must say, the code and the usage of the function is really nice. I just find the creation of the MetaSearch object less intuitiv than creating a dictionary and passing that to the search function to pass the metadata criteria.

You can also provide a tuple, however for the validation you are on your own then.

@chStaiger
Copy link
Member

On the whole I must say, the code and the usage of the function is really nice. I just find the creation of the MetaSearch object less intuitiv than creating a dictionary and passing that to the search function to pass the metadata criteria.

You can also provide a tuple, however for the validation you are on your own then.

Can we add the validation and transformation into a MetaSearch object to the search function if a list of tuples for metadata is provided? I think that would make it easier for our users who are not so familiar with programming but still write little python scripts.

@qubixes
Copy link
Collaborator Author

qubixes commented Aug 13, 2024

On the whole I must say, the code and the usage of the function is really nice. I just find the creation of the MetaSearch object less intuitiv than creating a dictionary and passing that to the search function to pass the metadata criteria.

You can also provide a tuple, however for the validation you are on your own then.

Can we add the validation and transformation into a MetaSearch object to the search function if a list of tuples for metadata is provided? I think that would make it easier for our users who are not so familiar with programming but still write little python scripts.

I feel like those users will simply copy paste whatever is put in the tutorial. I don't understand how it is more difficult though. Compare:

from ibridges import MetaSearch

search(path, metadata=MetaSearch(value="some_value")

or

search(path, metadata=[(..., "some_value", ...)])

Yes, it's shorter, but it looks much more like magic to me.

@qubixes qubixes marked this pull request as ready for review August 14, 2024 12:40
Copy link
Member

@chStaiger chStaiger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes in wording, small bugfixes in the tutorials. After that is fixed: Approved.

Nice work!!!

docker/irods_client/tests/test_search.py Show resolved Hide resolved
([MetaSearch("Author"), MetaSearch("Random")], False),
])
@mark.parametrize("item_name", ["collection", "dataobject"])
def test_find_meta(session, item_name, request, metadata, is_found):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where exactly is the metadata defined that we would like to search by?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the test function, see a few lines below it.

docs/source/irods_search.rst Outdated Show resolved Hide resolved
@@ -40,18 +32,17 @@ We need to create a python dictionary which contains the metadata keys ad their

.. code-block:: python

key_vals = {'key': 'value'}
search_data(session, key_vals = key_vals)
search_data(session, metadata=MetaSearch(key="key", value="value"))
key_vals = {'key': ''}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
key_vals = {'key': ''}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -40,18 +32,17 @@ We need to create a python dictionary which contains the metadata keys ad their

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still some reference to the dictionaries which we used previously. Replace with:

To search by metadata we need to create a MetaSearch for each key, value, units triple:

MetaSearch(key="my_key", value="my_value", units="my_units")

The above statement means: find all data objects and collections which is annotated with a key "my_key" where the value is "my_value" and the units is "my_units".

You can also omit any of the three items which will be interpreted as "anything". E.g.

MetaSearch(value="my_value")

means: find all data which is labeled with any key where the value is "my_value" and the units can also be anything. Here again you can also use wild cards.

A query with metadata will look like (code block below)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

docs/source/irods_search.rst Outdated Show resolved Hide resolved
key_vals = {'key': ''}
search_data(session, key_vals = key_vals)
search_data(session, metadata=MetaSearch(key="key", value=""))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Give an example with metadata=[MetaSearch(key="my_key", value="my_value"), MetaSearch(key="my_key2")]
to show how to combine several metadata requirements.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

checksum: Optional[str] = None,
key_vals: Optional[dict] = None,
) -> list[dict]:
metadata: Union[None, MetaSearch, list[MetaSearch], list[tuple]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now when I see the documentation and how we can explain the MetaSearch, I think we could also drop the tuple for the metadata. Don't be mad, I am aware that I was the reason for adding it to the function. But now I think it is more clear when we only use the MetaSearch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetaSearch is a (named) tuple, so actually list[tuple] will work regardless. I think we should keep it like this, but use MetaSearch in all our documentation.

ibridges/search.py Outdated Show resolved Hide resolved
@jjkoehorst
Copy link

I like the new structure of the code. It makes it much more readable. I think we need to add some good test examples to see whether the search retrieves all or not too many results.

  • path_pattern = %/my_coll/%books, where books is a collection
  • path_pattern = %/my_coll/%book.txt
  • Testing the % in the metadata and in the checksum.
  • ...

Will this also work?

path_pattern = %/my_coll/%books, where books is a data object?

In response to #243. For me personally it took me a while to figure out why can't it find %fancy%path%bla.txt? as I was missing a magic / ? Would it be possible to have a collection= and data= optional search so you can more clearly specify what you want or is that making things more difficult for users?

@chStaiger
Copy link
Member

I like the new structure of the code. It makes it much more readable. I think we need to add some good test examples to see whether the search retrieves all or not too many results.

  • path_pattern = %/my_coll/%books, where books is a collection
  • path_pattern = %/my_coll/%book.txt
  • Testing the % in the metadata and in the checksum.
  • ...

Will this also work?

path_pattern = %/my_coll/%books, where books is a data object?

In response to #243. For me personally it took me a while to figure out why can't it find %fancy%path%bla.txt? as I was missing a magic / ? Would it be possible to have a collection= and data= optional search so you can more clearly specify what you want or is that making things more difficult for users?

Yes that is true. To denote that you are looking for subcollections, you will need to add the "/". I also do not particularly like it, but it is in line with the icommands.
We will see what we can do about it, as the wildcard % does not seem to include the /.

@jjkoehorst
Copy link

How is it inline with iCommands? Just curious as iquest:

iquest --no-page "SELECT COLL_NAME, DATA_NAME WHERE COLL_NAME LIKE '%humann3%' AND DATA_NAME LIKE '%'" makes the distinction?

@chStaiger
Copy link
Member

How is it inline with iCommands? Just curious as iquest:

iquest --no-page "SELECT COLL_NAME, DATA_NAME WHERE COLL_NAME LIKE '%humann3%' AND DATA_NAME LIKE '%'" makes the distinction?

I just did some testing. You do not need to the "/". You will get all collections, subcollection and data objects with the substring denoted in path_pattern. Here the output of our CLI:

christine@sibelius iBridges % ibridges search --path-pattern "%demo%"
/nluu12p/home/research-test-christine/demo/demofile.txt
/nluu12p/home/research-test-christine/new_coll/demofile.txt
/nluu12p/home/research-test-christine/demo
/nluu12p/home/research-test-christine/demofile.txt

@trel
Copy link
Contributor

trel commented Aug 15, 2024

Does it also pick up this file?

/nluu12p/home/research-test-christine/demo/another.txt

@trel
Copy link
Contributor

trel commented Aug 15, 2024

and this collection?

/nluu12p/home/research-test-christine/demon/

@chStaiger
Copy link
Member

chStaiger commented Aug 15, 2024

@trel The first one not, but the demon yes.
The problem we are thinking about is, that the pattern can be split out over the COLL_NAME and the DATA_NAME.
So how to formulate this with the querying module from the PRC?!

@trel
Copy link
Contributor

trel commented Aug 15, 2024

I think you may have to perform two queries? One for collections and one for data objects? @d-w-moore?

@chStaiger
Copy link
Member

chStaiger commented Aug 15, 2024

I already identified three:

#Get all data objects on a path -pattern
"select COLL_NAME, DATA_NAME where COLL_NAME like '%demo%'"
#Get all data objects adhering to the path-pattern"
iquest "select COLL_NAME, DATA_NAME where DATA_NAME like '%demo%'"
#Get all collections adhering to the path-pattern"
iquest "select COLL_NAME where COLL_NAME like '%demo%'"

But if we assume that the path-pattern can be distributed over collection paths and object names, we end up with a lot of queries ;)

@qubixes
Copy link
Collaborator Author

qubixes commented Aug 15, 2024

I think you may have to perform two queries? One for collections and one for data objects? @d-w-moore?

I think more depending on the number of wildcards in the pattern. The problem is splitting up the pattern into one part for the collection and another part for the data object. Let's say we have the pattern "%a%b%c":

  • %a%b%c collection
  • %a%b% collection, c data object
  • %a%b collection, %c data object
  • etc

I think we will have to require users to put a slash in the pattern to tell us where the separation is between the collection part of the pattern and where the data object part is of the pattern. We could allow multiple patterns, but I'm not sure how much that would actually solve in this case. Ah @chStaiger is still awake as well..

@trel
Copy link
Contributor

trel commented Aug 16, 2024

Is it limiting to disallow forward slashes in the search?

Can you just have them insert alphanumeric characters and then you prepend and append the % before searching? And then you're back to just two queries...

Scenario... looking for counterexamples...

User enters foo and you search for:

  1. select COLL_NAME, DATA_NAME where DATA_NAME like '%foo%'
  2. select COLL_NAME where COLL_NAME like '%foo%'

Is that sufficient? Until you give them an 'advanced search' where they can actually specify more clearly what they are looking for with their own column names and wildcards?

@chStaiger
Copy link
Member

chStaiger commented Aug 16, 2024

That does not cover all scenarios. Assume someone is looking only for '.txt' files in a collection. Those two searches would not be able to pick that up. So we would at least need the third query where we compare the DATA_NAME to the pattern or the last segment of the pattern when the '/' is used.

What we could do is to introduce another parameter: name_pattern
Then we would only need two queries:

select COLL_NAME, DATA_NAME where COLL_NAME like path_pattern and DATA_NAME like name_pattern
select COLL_NAME where COLL_NAME like path_pattern +'/'+ name_pattern

@qubixes What do you think about that? It would make the code much clearer than having three queries of which two are data queries.

@qubixes
Copy link
Collaborator Author

qubixes commented Aug 16, 2024

@chStaiger I would prefer not to have more parameters, especially if it isn't immediately clear to our (novice) users what it means.

I think we should keep it the way it is for now (the behavior hasn't changed from the current release version). All queries that a user might want to do (and are possible) are currently covered as far as I can see. I do think it will be useful to think about how this can be documented. Perhaps we can have a page with wildcard examples showing what will match and what will not match.

@chStaiger chStaiger self-requested a review August 19, 2024 14:04
@chStaiger chStaiger merged commit 5308f2a into develop Aug 19, 2024
9 checks passed
@chStaiger chStaiger deleted the update-search branch August 19, 2024 14:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Search Add search functionality to the CLI Improve search_data results
4 participants