-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP glob feature #936
base: main
Are you sure you want to change the base?
WIP glob feature #936
Conversation
…ame with 'glob_filename' and 'glob_count' columns.
…lob_count from a log .darshan file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me start with a bit of code review/design considerations and then I'll make a checklist of some things to try, especially since I'll be away next week. If I get carried away the Argonne folks can suggest directions they think are more promising than others since they work with filepaths a lot.
The first thing I tried was of course to run your code on a sample log file in the logs repo, e3sm_io_heatmap_only.darshan
.
Here is what the small table for that fairly simple log looks like:
There are some immediately obvious things that seem possible to improve here:
- the filepath with a glob count of
1
,/projects/radix-io/E3SM-IO-inputs/.*
, doesn't make sense--why would you regex-condense a single file? There's no value in removing that information here, may as well tell the user what the file is. - For the filepath with the glob count of
2
,/projects/radix-io/snyder/e3sm/.*
, doesn't really tell me much about what comes after since it uses the most broad regular expression metacharacter combination possible with.*
.
If we print out the names of the 2 grouped files in the original data structure:
/projects/radix-io/snyder/e3sm/can_I_out_h0.nc
/projects/radix-io/snyder/e3sm/can_I_out_h1.nc
we can start to design tests (yay!) to make our output closer to something that gives the user more information about the filenames in that group (this is only rough/pseudo-code meant to inspire you, it is not meant to be copy-pasted without consideration for what is going on, you will need to carefully modify the code and think about where to place modules/imports, etc.):
import os
import darshan
from darshan.log_utils import get_log_path
# incorporate your module into darshan properly
# instead of this:
import glob_feature
import pandas as pd
from pandas.testing import assert_frame_equal
import pytest
@pytest.mark.parametrize("log_name, expected_df", [
# grow this with more logs...
("e3sm_io_heatmap_only.darshan",
pd.DataFrame({"filename_glob":
# NOTE: usage of \d digit metacharacter in regex pattern
# and perservation of all text that is identical in the glob
# group instead of using `.*` everywhere, to convey
# the filepath information with sufficient granularity
["/projects/radix-io/snyder/e3sm/can_I_out_h\d.nc",
# NOTE: for the single file group, match the entire path,
# no point in hiding the file name...
"/projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc"],
"glob_count": [2, 1]})),
])
def test_glob_tables(tmpdir, log_name, expected_df):
# test the glob table HTML outputs for various
# log files in the logs repo (and new log files
# that you creatively design yourself)
log_path = get_log_path(log_name)
with tmpdir.as_cwd():
cwd = os.getcwd()
# TODO: you shouldn't have a hardcoded HTML filename
# like this...
outfile = os.path.join(cwd, "name_record_glob_hd5f.html")
glob_feature.main(log_path, outfile)
actual_table = pd.read_html(outfile)[0]
assert_frame_equal(actual_table, expected_df)
You'll want to study that test structure until you understand how I'm enforcing the table output that I want from your code. Of course, if we run it now, it will fail as expected because of the regex issues described above:
pytest test_globber.py
E AssertionError: DataFrame.iloc[:, 0] (column name="filename_glob") are different
E
E DataFrame.iloc[:, 0] (column name="filename_glob") values are different (100.0 %)
E [index]: [0, 1]
E [left]: [/projects/radix-io/snyder/e3sm/.*, /projects/radix-io/E3SM-IO-inputs/.*]
E [right]: [/projects/radix-io/snyder/e3sm/can_I_out_h\d.nc, /projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc]
E At positional index 0, first diff: /projects/radix-io/snyder/e3sm/.* != /projects/radix-io/snyder/e3sm/can_I_out_h\d.nc
pandas/_libs/testing.pyx:172: AssertionError
You can see it complaining about the mismatch between your regexes/glob patterns and my desired output--great! Now you have a programmatic way to enforce correctness, and you can expand it to more log files as you improve the code so that you don't regress, which helps prevent you from "overfitting" to the properties of a single log file.
So, to summarize a step-by-step procedure to move the effort forward:
- write a simple regression test for a single/simple log file, and then adjust your code to make it pass, and try not to cheat--i.e., use
difflib
or other creative/interesting ideas to compare the strings and programmatically generate the specific regex where the group members differ - when that initial test passes, commit your code so you don't lose your progress, then expand the test (look into
pytest.parametrize
docs, usegit grep
on the codebase, etc.) to include another log file in thedarshan-logs
repo - continue with this cycle until you have tests passing for the desired output for each of the log files in the logs repo
Some broader advice:
- read the
difflib docs
carefully, there's a lot of power there - read the original algorithm paper published in the late 1980’s by Ratcliff and Obershelp, on which
difflib
is built, or at least the wikipedia article: https://en.wikipedia.org/wiki/Gestalt_pattern_matching - look for other papers that cite that technique on i.e., Google Scholar, a bit like citation map searching for other recent ideas/modern techniques to consider as well
-
git_project/glob_feature/glob_feature.py
is a bit of a silly place to place your module long-term, it is not even in a Python project at the moment, so you'll likely have trouble following some of my testing designs until yougit grep
around the code base a little and find a nice place to move it to (along with the testing module(s) you write) -- the early parts of this doc and the links therein may help: https://packaging.python.org/en/latest/tutorials/packaging-projects/ - start exploring other clustering algorithms with internet searches: https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups, https://stackoverflow.com/questions/69006583/how-to-kmeans-cluster-strings, and look into some of the Python machine learning libraries like https://scikit-learn.org/stable/modules/clustering.html to see if they help give a "second opinion" in the complex cases that
difflib
can't handle (I don't think the example I initially provided is complex enough for that, but other cases may be) - if you manage to get tests passing for all logs in the logs repo, with expected values having nice/reasonably specific regexes (not
.*
everywhere), then you can really start with the crazy tests like writing your ownmpi4py
scripts that write random filepaths/unrelated filepaths/synthetically related filepaths (re: our initial exercises writing and monitoring those scripts withdarshan-runtime
)
That should be enough to keep you busy while I'm away next week (I certainly couldn't do that in a week!), and perhaps the Argonne folks can help with the occasional simple question on i.e., Slack or if pinged here.
html = style.to_html() | ||
|
||
# can change name of the output html report here | ||
with open("name_record_glob_hd5f.html", "w") as html_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You actually can't change the name on the command line with --output-path
anymore because you've hardcoded this, so you should decide if you want to keep that feature or automatically name the output html table, then write a test for it using pytest
so it doesn't break in the future.
df = df.sort_values(by="glob_count", ascending=False) | ||
|
||
|
||
def find_common_prefix(paths): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no reason to nest this function inside of regex_df_condenser
-- it makes it harder to test it because it isn't in the main namespace anymore and it is also just harder to read when nested like this.
|
||
|
||
|
||
def main(log_path, output_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the output_path
argument isn't used in this function anymore
|
||
|
||
def regex_df_condenser(df, paths): | ||
path_grouper_func = make_path_grouper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems to discard the filenames, so you may need to preserve them in another data structure i.e., df_orig = df.copy()
and then use difflib
to also compare the filenames later below (difflib
really is great, but it requires quite some effort to get it to do exactly what you want).
This function is also pretty confusing--I think you should perhaps write it out on paper or the whiteboard first and then add comments to each section explaining out how it will get you to this (or similar) final output for e3sm_io_heatmap_only.darshan
:
Also, now that I look at it, perhaps it is a bit hard to see the regex metacharacter \d
in the path, so perhaps it may make sense to underline/highlight it somehow, though that's more of an aesthetic concern for a bit later.
…r e3sm_io_heatmap_only.darshan log file. Also, relocated the script to a more suitable spot among other Python files in the project.
Agreed on Tyler's feedback, specifically:
So, in the 2nd case there's probably some additional work needed to find out exactly where the "matching" globs differ, with us clearly marking where this difference occurs. In my opinion, I don't think we need to be super elaborate in terms of how we indicate what the non-matching parts of a glob are. E.g., above I don't think we need to mark that the [.*] portion of the glob is a number (\d). My thought process there is just that it seems like it could produce some complicated regular expressions that might be confusing for users. I think most helpful is just knowing exactly where the globs differ, not necessarily how. You could also maybe offer like a verbose option so users could have all of the filenames associated with a glob printed out to see exactly what all is included, or something like that. I'll try to share some other logs I have that have much more data (i.e., filenames to glob) that we could use to see how we handle more complicated use cases -- I have them, just need to make them available at darshan-logs repo. |
I think the main additions I'd want to see near-term are:
|
{"selector": "th", "props": [("border", "1px solid grey")]} | ||
]) | ||
|
||
style = style.hide_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you pointed out on Slack, this can fail sometimes, probably depending on the pandas
version in use. It looks like it was deprecated and removed--for more information on how you might solve this see: pandas-dev/pandas#43771
You may need to check the pandas
version string and make a decision based on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample traceback:
Traceback (most recent call last):
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 89, in <module>
main(log_path=args.log_path, output_path=args.output_path)
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 77, in main
style = style.hide_index()
^^^^^^^^^^^^^^^^
AttributeError: 'Styler' object has no attribute 'hide_index'. Did you mean: 'hide_index_'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, to_html()
already has an argument to suppress the index, so I'm a bit confused why we'd even want to mess around with the styler at all here.
Studying the code base is often helpful, i.e., git grep -E -i "to_html"
in the root of the repo will show some helpful examples
]) | ||
|
||
style = style.hide_index() | ||
html = style.render() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is render
being used here? Isn't that Jupyter notebook thing? I'm pretty sure to_html()
is what we want. You may need to study the docs here a bit: https://pandas.pydata.org/docs/user_guide/style.html and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_html.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if I fix your hide_index
issue above locally, I then get this traceback, which is another reason we'd want to see the tests running/passing before we start to iterate ourselves (to make sure it works on a basic level):
Traceback (most recent call last):
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 89, in <module>
main(log_path=args.log_path, output_path=args.output_path)
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 78, in main
html = style.render()
^^^^^^^^^^^^
AttributeError: 'Styler' object has no attribute 'render'. Did you mean: '_render'?
html = style.render() | ||
|
||
with open(output_path, "w") as html_file: | ||
html_file.write(html) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I fix both of the above issues and then try to run your code via the Python command line route, I still get an error:
python glob_feature.py -p ~/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan
Traceback (most recent call last):
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 90, in <module>
main(log_path=args.log_path, output_path=args.output_path)
File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 81, in main
with open(output_path, "w") as html_file:
^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType
Some of the testing I mentioned a few weeks ago about handling the various output_path
modalities seems to be missing? You'll want tests for the command line and module-based incantations to make sure they work as you iterate on your code.
If I hack around some of the bugs noted above, I can finally get your script to produce a table for That looks a bit closer to what we want, though I'm not sure why you decide to remove the heatmap-style coloring from the table? The process for moving forward might then be to ask the Argonne folks for feedback on that table given the raw/original filenames:
From Shane's feedback above, it sounds like we might want to switch |
I'll add a few review comments now that Yaris has pushed up a first attempt at working with There are a few things to address here:
So, the module at path Changes to be committed:
(use "git restore --staged <file>..." to unstage)
deleted: glob_feature.py
Now we get the same failure that you can see in the CI after you pushed up. When you see failures in the CI like that, it usually means there's a problem with your pull request/code:
Ok, so let's fix that error and try again: (py_311_darshan_dev) treddy@pn2305118 pydarshan % git diff
diff --git a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
index 8a779857..b26cf309 100644
--- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,10 +6,8 @@ print(pd.__version__)
from pandas.testing import assert_frame_equal
import pytest
import re
-print(sys.path) # Print sys.path again
import glob_feature
-print("hello")
@pytest.mark.parametrize("log_name, expected_df", [
# grow this with more logs...
("e3sm_io_heatmap_only.darshan", Now, we run the test again and get another failure--it looks like you're still having issues with Python packaging because the import doesn't work:
Well, the import works for every single other module under test in the project, so why not just study one of them as an example?
It looks like those
Ok, let's add that in, so:
Try the testing again, still get
So, for us, maybe something like: --- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,7 +6,7 @@ print(pd.__version__)
from pandas.testing import assert_frame_equal
import pytest
import re
-import glob_feature
+from darshan.glob_feature import glob_feature
@pytest.mark.parametrize("log_name, expected_df", [
# grow this with more logs... And now the test will actually run--of course it fails, and we've already discussed this failure before:
Part of the confusion may be that you had two copies of the module that weren't synced (that's always a terrible idea!), or maybe you copy-pasted something without genuinely understanding it, but in any case, this should help you move forward quite a bit, since if your local env is setup probably, the collection of changes above should allow you to run the test with:
There a quite a few other issues in the code, and one is so major that I'll add an inline comment right away, but the others I'm hoping you'll study carefully and clean up a bit. |
outfile = os.path.join(cwd, "name_record_glob_hd5f.html") | ||
glob_feature.main(log_path, outfile) | ||
actual_table = pd.read_html(outfile)[0] | ||
actual_table.drop("Unnamed: 0", axis=1, inplace=True) # Drop the "Unnamed: 0" column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you doing this? If there's a problem in the actual table, there's a problem in your actual code, and you should fix the code, not conceal the problem by mutating the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My guess would be the index column perhaps, I'm not convinced you got a full understanding of the html conversion process just yet vs. pandas versions, so maybe look into that a bit more.
I'll wait for another round of cleanups before commenting again--there's a lot of changes I suggested, and having two separate copies of the new code module in the same branch is basically insanity to review, so I won't try to guess what you're doing there. But hopefully I've given you a boost here at least... |
Possible checklist for updating the PR might be something like:
|
…ed __init__.py to glob_feature. Fixed errors in test_glob_feature.py.
Looks like a step in the right direction now.
Yes, you should only have one
As for algorithms, I only sent you the Also, this is still broken on your branch, so you should try to fix it and add a regression test using
The incantation with Finally, the CI is failing for this PR, so you should try to add in the appropriate dependency for now, so that we can see your tests passing or failing appropriately. |
In case it wasn't clear from the previous meeting, here is some guidance on the approach to testing/vision for moving forward:
At the end of the project, even if the final product from the summer is basically just a bunch of agreed-upon but very well crafted test cases, that's already considerable value delivered, since the task of identifying exactly what folks want is a big part of the challenge of a software project, and encoding that in tests will make it easier to move forward. |
… uses agglomerative hierarchal clustering for grouping. The test_glob_feature.py expanded to more log files. The dependencies for these scripts were added to main_ci.yml.
Description of Changes: Additionally, I have expanded the test_glob_feature.py to include more log files from the darshan log repository. This expansion ensures a more comprehensive testing process, covering a broader range of scenarios and data. I've added the necessary dependencies to the main_ci.yml. Next Steps: Moreover, I am currently exploring ways to display more commonalities within the representing file path. While the current implementation effectively finds differences and replaces them with [.*], there is room for improvement in handling commonalities at the end of file paths. For example, if multiple file paths have common elements at their endings but different positions, these commonalities are currently not accounted for in the representing file path. To address this, I am considering indexing from the back to the front and concatenating the common elements to the merged_path, which would create a more informative and accurate representing file path. Here is an example of the issue: The reprsenting filepath would be Even though the file paths all contain 000 in the last directory. Any insight on how to approach this problem? Here are the lines of code that deal with switching differing characters within the group with [.*] and finds if the paths within a group have a common file extensions: # these lines create the representing file path for a group
|
print(expected_df) | ||
|
||
# Compare the two DataFrames | ||
diff = actual_table['filename_glob'].compare(expected_df['filename_glob']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is causing problems--I see this locally:
# Compare the two DataFrames
> diff = actual_table['filename_glob'].compare(expected_df['filename_glob'])
/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:301:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/series.py:3185: in compare
return super().compare(
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/generic.py:9212: in compare
mask = ~((self == other) | (self.isna() & other.isna()))
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/ops/common.py:81: in new_method
return method(self, other)
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/arraylike.py:40: in __eq__
return self._cmp_method(other, operator.eq)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = 0 /home/shane/software/ior/build/testFile
1 /home/shane/software/ior/build/testFile:/Datas...
Name: filename_glob, dtype: object
other = 0 /home/shane/software/ior/build/testFile[.*]
Name: filename_glob, dtype: object, op = <built-in function eq>
def _cmp_method(self, other, op):
res_name = ops.get_op_result_name(self, other)
if isinstance(other, Series) and not self._indexed_same(other):
> raise ValueError("Can only compare identically-labeled Series objects")
E ValueError: Can only compare identically-labeled Series objects
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/series.py:6090: ValueError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really no reason to have this line present anymore, the assert_frame_equal
should suffice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I delete that stuff, then I see a bunch of test failures like this:
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_bytes_range_dxt.darshan-expected_df6] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[runtime_and_dxt_heatmaps_diagonal_write_only.darshan-expected_df20] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_half_flush_dxt.darshan-expected_df7] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[treddy_runtime_heatmap_inactive_ranks.darshan-expected_df21] - ValueError: empty vocabulary; perhaps the documents only contain stop words
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_half_ranks_dxt.darshan-expected_df8] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_file_opens_only.darshan-expected_df9] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[darshan-apmpi-2nodes-64mpi.darshan-expected_df2] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_1_byte_dxt.darshan-expected_df5] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[imbalanced-io.darshan-expected_df11] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[shane_ior-HDF5_id438090-438090_11-9-41522-17417065676046418211_1.darshan-expected_df12] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[shane_ior-PNETCDF_id438100-438100_11-9-41525-10280033558448664385_1.darshan-expected_df13] - AssertionError: DataFrame are different
"glob_count": [2, 1]})), | ||
|
||
|
||
("hdf5_diagonal_write_1_byte_dxt.darshan", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking this log file as an example of something you might want to get feedback from the Argonne team on.
The original DataFrame seems to be this:
6329959920961583796 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__init__.py
12081676647536809530 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/__init__.cpython-38.pyc
9019270385924777942 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_errors.cpython-38-x86_64-linux-gnu.so
17233738746347876168 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/version.py
10563276375451711199 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/version.cpython-38.pyc
16409530009729583527 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5.cpython-38-x86_64-linux-gnu.so
13248362266925776299 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/defs.cpython-38-x86_64-linux-gnu.so
4638787767486382508 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_objects.cpython-38-x86_64-linux-gnu.so
10170641380995794425 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5py_warnings.py
8697898997669214892 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/h5py_warnings.cpython-38.pyc
7051742911845937566 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_conv.cpython-38-x86_64-linux-gnu.so
9236461081719577833 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5r.cpython-38-x86_64-linux-gnu.so
2181399625010544346 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5t.cpython-38-x86_64-linux-gnu.so
6528078802042034082 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5p.cpython-38-x86_64-linux-gnu.so
2575676175211135484 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5s.cpython-38-x86_64-linux-gnu.so
11727641573666397676 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/utils.cpython-38-x86_64-linux-gnu.so
17901672647343149144 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ac.cpython-38-x86_64-linux-gnu.so
9518436409801048664 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/logging/__init__.py
2863652681302921111 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/logging/__pycache__/__init__.cpython-38.pyc
2204154745069321427 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/traceback.py
555215874445445411 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/traceback.cpython-38.pyc
3096762644015249429 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/string.py
3433981731425580030 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/string.cpython-38.pyc
6622959456894657244 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5z.cpython-38-x86_64-linux-gnu.so
16198910964451933166 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5a.cpython-38-x86_64-linux-gnu.so
12844965888207067364 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_proxy.cpython-38-x86_64-linux-gnu.so
8320392657652582313 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5d.cpython-38-x86_64-linux-gnu.so
2182783899281604896 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ds.cpython-38-x86_64-linux-gnu.so
6824284610368689248 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5f.cpython-38-x86_64-linux-gnu.so
2255611827735696210 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5g.cpython-38-x86_64-linux-gnu.so
7369759469852332435 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5i.cpython-38-x86_64-linux-gnu.so
3786189409375561029 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5fd.cpython-38-x86_64-linux-gnu.so
9453911172532451432 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5pl.cpython-38-x86_64-linux-gnu.so
7281569168338684097 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__init__.py
6456765083458423379 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/__init__.cpython-38.pyc
1619390675886412049 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/filters.py
6004586721157325469 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/filters.cpython-38.pyc
7590428293607185835 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/compat.py
941427344463918395 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/compat.cpython-38.pyc
7880173344308596059 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/base.py
412714883876212763 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/base.cpython-38.pyc
8629723084015159315 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/files.py
1259607287293063192 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/files.cpython-38.pyc
1754798185250444829 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/group.py
14745068672433573323 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/group.cpython-38.pyc
8305533062347777536 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5o.cpython-38-x86_64-linux-gnu.so
15958169687914591130 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5l.cpython-38-x86_64-linux-gnu.so
11692197895731416386 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/dataset.py
18374151976672635155 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/dataset.cpython-38.pyc
5947139205337478310 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_selector.cpython-38-x86_64-linux-gnu.so
4123734041483639167 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/selections.py
17375372070410440550 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/selections.cpython-38.pyc
11306757759007388579 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/selections2.py
5194722322225707588 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/selections2.cpython-38.pyc
13359638461944338488 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/datatype.py
17244020578242913766 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/datatype.cpython-38.pyc
16568253267388065553 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/vds.py
8979639072945830616 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/vds.cpython-38.pyc
15318755126620508599 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/copy.py
12711741908924247827 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/copy.cpython-38.pyc
7714427140906174735 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/attrs.py
9344690504834470365 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/attrs.cpython-38.pyc
16918611628337716751 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/uuid.py
8511949921801793794 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/uuid.cpython-38.pyc
2130744796581944128 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_0.h5
17389075414465034664 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_0.h5:/dataset
4223954931879612592 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_1.h5
14104557419321271749 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_1.h5:/dataset
3586708446079182075 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_2.h5
14252080315496364280 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_2.h5:/dataset
17916052597340637289 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_3.h5
15211073457117916342 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_3.h5:/dataset
9342247733453943503 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_4.h5
10985895144053482380 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_4.h5:/dataset
4898891139486729695 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_5.h5
15357467061712271470 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_5.h5:/dataset
3386331313084499192 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_6.h5
409581163713740906 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_6.h5:/dataset
7972149951920014792 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_7.h5
2753966606541326699 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_7.h5:/dataset
5562527676113658148 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_8.h5
6263533231926069969 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_8.h5:/dataset
8261993854308713163 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5
1478452523172783126 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5:/dataset
That table has file paths including the following:
8320392657652582313 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5d.cpython-38-x86_64-linux-gnu.so
2182783899281604896 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ds.cpython-38-x86_64-linux-gnu.so
6824284610368689248 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5f.cpython-38-x86_64-linux-gnu.so
2255611827735696210 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5g.cpython-38-x86_64-linux-gnu.so
7369759469852332435 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5i.cpython-38-x86_64-linux-gnu.so
3786189409375561029 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5fd.cpython-38-x86_64-linux-gnu.so
9453911172532451432 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5pl.cpython-38-x86_64-linux-gnu.so
But the granularity in the test below is obviously coarser than clustering that group individually, so may be good to ask the Argonne team what they want in this case. Your actual and expected values do still differ here as well, so I'm curious if you are still maintaining that the "expected" value is the right one here?
actual_table:
filename_glob glob_count
0 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*] 64
1 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5[.*] 20
expected_df:
filename_glob glob_count
0 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/[.*] 54
1 /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5[.*] 20
2 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*] 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully, it doesn't throw a wrench in anything, but after looking closer at the filenames for this example, I think I'd actually suggest something like this for expected output (ignoring the glob counts, I didn't implement this so don't have the final counts handy):
/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].py
/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].pyc
/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].so
/yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5
So, at a high-level it tries to find the "common" directories that have lots of files in them, then within those common directories bins things based on suffix? I think that works well for this particular scenario, but maybe others have different thoughts. Of course, you'll have to have some algorithmic way to decide when to stop splitting into common subdirectories (and start trying to group into suffixes), if that makes sense.
Note, that I'm also ignoring the weird HDF5 record names that include the dataset (e.g., '/yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5:/dataset'). I still think we should explicitly ignore those. The only way I know how to do that is to restrict record names to only those coming from POSIX and STDIO modules:
- Read in POSIX and STDIO module data
- Remove all name records that don't have a corresponding match in the POSIX/STDIO record list
- Perform analysis on those records
I think most of this table looks good, but I have a different thought on what might work best for the issue above. I'd actually recommend either only:
Or the list of:
Does that simplify anything or make things more complicated? I think either of those options are more helpful than the current output. Note, that what we decide there could also affect what we decide to do with these 2 further up:
There's also a couple of examples that are maybe more complex than they need to be visually (I know that's not always easy to square with what the algorithm is doing):
In both cases, there's a couple of groups of "match anything" blocks ([.*]) separated by one or 2 characters in between them. I think they'd look better if they were just collapsed down into a single [.*] (throwing out the additional context of the '00' or the 'i'). Not sure if that's easy to accomplish, but maybe you avoid having such a short run of characters between two [.*] blocks? |
@shanedsnyder Here is the same test case (acme_fcase_trace) but after it groups the common prefixes together it then groups by file extensions. Would this be desired output? Here is another example with the hdf5_diagonal_write_1_byte test case: |
Yeah, I find those to be a lot more helpful personally, but maybe other folks have a different thought. My only additional comment is for the first example you shared, I notice that all of the files that don't match a glob (i.e., count of 1, no wildcard) do all share a common prefix that's pretty deep into the path name -- I wonder if in that case if we should instead glob them all into one Thinking out loud a little more, regardless of what we decide above, it might be nice if there was like a "verbose" option to the tool that could spit out the entire list of all matching files in each glob (maybe indented right beneath the glob patterns in the tables you already have)? That way even if the tool doesn't group things in a way that's super helpful for users, they can still check and see what all files got matched to each glob, if that makes sense? |
…cal clustering and common file extension. It also has a verbose option (-v) which displays all the files within the respective groups. I added some modifications for the ideal values in the test_glob_feature.py script
Related to gh-749
This pull request introduces the
glob_feature.py
script, which enhances the functionality of the existing HTML PyDarshan report. The purpose of this script is to provide more detailed information about the file paths accessed by the Darshan log file specified on the command line.Currently, the PyDarshan report displays the list of accessed files without specific details. With the
glob_feature.py
script, we aim to address this limitation by generating a condensed and organized data frame. This data frame consists of two columns:filename_glob
andglob_count
.The
filename_glob
column contains the file paths being read by the Darshan log file, providing a clear representation of the accessed files. Theglob_count
column indicates the number of times each file path occurs, offering insights into the frequency of access.Furthermore, this script goes beyond the common prefixes of file paths and includes instances where there is additional information beyond the last common directory. These cases are represented by the
.*
regex pattern, emphasizing the diversity of the file paths.We hope to eventually incorporate this script into the PyDarshan summary report, in order to enhance the understanding of file access patterns and provide more comprehensive information to users. Your feedback and suggestions for further improvements are welcome.