Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP glob feature #936

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

yariseidenbenz
Copy link

@yariseidenbenz yariseidenbenz commented May 31, 2023

Related to gh-749

This pull request introduces the glob_feature.py script, which enhances the functionality of the existing HTML PyDarshan report. The purpose of this script is to provide more detailed information about the file paths accessed by the Darshan log file specified on the command line.

Currently, the PyDarshan report displays the list of accessed files without specific details. With the glob_feature.py script, we aim to address this limitation by generating a condensed and organized data frame. This data frame consists of two columns: filename_glob and glob_count.

The filename_glob column contains the file paths being read by the Darshan log file, providing a clear representation of the accessed files. The glob_count column indicates the number of times each file path occurs, offering insights into the frequency of access.

Furthermore, this script goes beyond the common prefixes of file paths and includes instances where there is additional information beyond the last common directory. These cases are represented by the .* regex pattern, emphasizing the diversity of the file paths.

We hope to eventually incorporate this script into the PyDarshan summary report, in order to enhance the understanding of file access patterns and provide more comprehensive information to users. Your feedback and suggestions for further improvements are welcome.

Copy link
Collaborator

@tylerjereddy tylerjereddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me start with a bit of code review/design considerations and then I'll make a checklist of some things to try, especially since I'll be away next week. If I get carried away the Argonne folks can suggest directions they think are more promising than others since they work with filepaths a lot.

The first thing I tried was of course to run your code on a sample log file in the logs repo, e3sm_io_heatmap_only.darshan.

Here is what the small table for that fairly simple log looks like:
image

There are some immediately obvious things that seem possible to improve here:

  1. the filepath with a glob count of 1, /projects/radix-io/E3SM-IO-inputs/.*, doesn't make sense--why would you regex-condense a single file? There's no value in removing that information here, may as well tell the user what the file is.
  2. For the filepath with the glob count of 2, /projects/radix-io/snyder/e3sm/.*, doesn't really tell me much about what comes after since it uses the most broad regular expression metacharacter combination possible with .*.

If we print out the names of the 2 grouped files in the original data structure:

/projects/radix-io/snyder/e3sm/can_I_out_h0.nc
/projects/radix-io/snyder/e3sm/can_I_out_h1.nc

we can start to design tests (yay!) to make our output closer to something that gives the user more information about the filenames in that group (this is only rough/pseudo-code meant to inspire you, it is not meant to be copy-pasted without consideration for what is going on, you will need to carefully modify the code and think about where to place modules/imports, etc.):

import os

import darshan
from darshan.log_utils import get_log_path
# incorporate your module into darshan properly
# instead of this:
import glob_feature

import pandas as pd
from pandas.testing import assert_frame_equal
import pytest


@pytest.mark.parametrize("log_name, expected_df", [
     # grow this with more logs...
     ("e3sm_io_heatmap_only.darshan",
      pd.DataFrame({"filename_glob":
                    # NOTE: usage of \d digit metacharacter in regex pattern
                    # and perservation of all text that is identical in the glob
                    # group instead of using `.*` everywhere, to convey
                    # the filepath information with sufficient granularity
                    ["/projects/radix-io/snyder/e3sm/can_I_out_h\d.nc",
                    # NOTE: for the single file group, match the entire path,
                    # no point in hiding the file name...
                     "/projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc"],
                    "glob_count": [2, 1]})),
])
def test_glob_tables(tmpdir, log_name, expected_df):
    # test the glob table HTML outputs for various
    # log files in the logs repo (and new log files
    # that you creatively design yourself)
    log_path = get_log_path(log_name)
    with tmpdir.as_cwd():
        cwd = os.getcwd()
        # TODO: you shouldn't have a hardcoded HTML filename
        # like this...
        outfile = os.path.join(cwd, "name_record_glob_hd5f.html")
        glob_feature.main(log_path, outfile)
        actual_table = pd.read_html(outfile)[0]
        assert_frame_equal(actual_table, expected_df)

You'll want to study that test structure until you understand how I'm enforcing the table output that I want from your code. Of course, if we run it now, it will fail as expected because of the regex issues described above:

pytest test_globber.py

E   AssertionError: DataFrame.iloc[:, 0] (column name="filename_glob") are different
E   
E   DataFrame.iloc[:, 0] (column name="filename_glob") values are different (100.0 %)
E   [index]: [0, 1]
E   [left]:  [/projects/radix-io/snyder/e3sm/.*, /projects/radix-io/E3SM-IO-inputs/.*]
E   [right]: [/projects/radix-io/snyder/e3sm/can_I_out_h\d.nc, /projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc]
E   At positional index 0, first diff: /projects/radix-io/snyder/e3sm/.* != /projects/radix-io/snyder/e3sm/can_I_out_h\d.nc

pandas/_libs/testing.pyx:172: AssertionError

You can see it complaining about the mismatch between your regexes/glob patterns and my desired output--great! Now you have a programmatic way to enforce correctness, and you can expand it to more log files as you improve the code so that you don't regress, which helps prevent you from "overfitting" to the properties of a single log file.

So, to summarize a step-by-step procedure to move the effort forward:

  • write a simple regression test for a single/simple log file, and then adjust your code to make it pass, and try not to cheat--i.e., use difflib or other creative/interesting ideas to compare the strings and programmatically generate the specific regex where the group members differ
  • when that initial test passes, commit your code so you don't lose your progress, then expand the test (look into pytest.parametrize docs, use git grep on the codebase, etc.) to include another log file in the darshan-logs repo
  • continue with this cycle until you have tests passing for the desired output for each of the log files in the logs repo

Some broader advice:

  • read the difflib docs carefully, there's a lot of power there
  • read the original algorithm paper published in the late 1980’s by Ratcliff and Obershelp, on which difflib is built, or at least the wikipedia article: https://en.wikipedia.org/wiki/Gestalt_pattern_matching
  • look for other papers that cite that technique on i.e., Google Scholar, a bit like citation map searching for other recent ideas/modern techniques to consider as well
  • git_project/glob_feature/glob_feature.py is a bit of a silly place to place your module long-term, it is not even in a Python project at the moment, so you'll likely have trouble following some of my testing designs until you git grep around the code base a little and find a nice place to move it to (along with the testing module(s) you write) -- the early parts of this doc and the links therein may help: https://packaging.python.org/en/latest/tutorials/packaging-projects/
  • start exploring other clustering algorithms with internet searches: https://stats.stackexchange.com/questions/123060/clustering-a-long-list-of-strings-words-into-similarity-groups, https://stackoverflow.com/questions/69006583/how-to-kmeans-cluster-strings, and look into some of the Python machine learning libraries like https://scikit-learn.org/stable/modules/clustering.html to see if they help give a "second opinion" in the complex cases that difflib can't handle (I don't think the example I initially provided is complex enough for that, but other cases may be)
  • if you manage to get tests passing for all logs in the logs repo, with expected values having nice/reasonably specific regexes (not .* everywhere), then you can really start with the crazy tests like writing your own mpi4py scripts that write random filepaths/unrelated filepaths/synthetically related filepaths (re: our initial exercises writing and monitoring those scripts with darshan-runtime)

That should be enough to keep you busy while I'm away next week (I certainly couldn't do that in a week!), and perhaps the Argonne folks can help with the occasional simple question on i.e., Slack or if pinged here.

html = style.to_html()

# can change name of the output html report here
with open("name_record_glob_hd5f.html", "w") as html_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You actually can't change the name on the command line with --output-path anymore because you've hardcoded this, so you should decide if you want to keep that feature or automatically name the output html table, then write a test for it using pytest so it doesn't break in the future.

df = df.sort_values(by="glob_count", ascending=False)


def find_common_prefix(paths):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reason to nest this function inside of regex_df_condenser -- it makes it harder to test it because it isn't in the main namespace anymore and it is also just harder to read when nested like this.




def main(log_path, output_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the output_path argument isn't used in this function anymore



def regex_df_condenser(df, paths):
path_grouper_func = make_path_grouper()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function seems to discard the filenames, so you may need to preserve them in another data structure i.e., df_orig = df.copy() and then use difflib to also compare the filenames later below (difflib really is great, but it requires quite some effort to get it to do exactly what you want).

This function is also pretty confusing--I think you should perhaps write it out on paper or the whiteboard first and then add comments to each section explaining out how it will get you to this (or similar) final output for e3sm_io_heatmap_only.darshan:

image

Also, now that I look at it, perhaps it is a bit hard to see the regex metacharacter \d in the path, so perhaps it may make sense to underline/highlight it somehow, though that's more of an aesthetic concern for a bit later.

…r e3sm_io_heatmap_only.darshan log file. Also, relocated the script to a more suitable spot among other Python files in the project.
@shanedsnyder
Copy link
Contributor

Agreed on Tyler's feedback, specifically:

  • globs of a single file should really just be the full filename
  • the other glob in this case ("/projects/radix-io/snyder/e3sm/can_I_out_h0.nc", "/projects/radix-io/snyder/e3sm/can_I_out_h1.nc") could be condensed into something like "/projects/radix-io/snyder/e3sm/can_I_out_h[.*].nc"

So, in the 2nd case there's probably some additional work needed to find out exactly where the "matching" globs differ, with us clearly marking where this difference occurs.

In my opinion, I don't think we need to be super elaborate in terms of how we indicate what the non-matching parts of a glob are. E.g., above I don't think we need to mark that the [.*] portion of the glob is a number (\d). My thought process there is just that it seems like it could produce some complicated regular expressions that might be confusing for users. I think most helpful is just knowing exactly where the globs differ, not necessarily how. You could also maybe offer like a verbose option so users could have all of the filenames associated with a glob printed out to see exactly what all is included, or something like that.

I'll try to share some other logs I have that have much more data (i.e., filenames to glob) that we could use to see how we handle more complicated use cases -- I have them, just need to make them available at darshan-logs repo.

@tylerjereddy
Copy link
Collaborator

I think the main additions I'd want to see near-term are:

  • addition of testing infrastructure so your new code is clearly passing/failing tests in the CI (continuous integration) here; part of that is getting a consensus with the Argonne folks on what the "expected HTML tables" should look like for a given file, which is probably best done by combining the testing code with some examples of HTML result tables from your code vs. the original raw paths from various logs
  • expanding that testing infrastructure to more and more log files, including cases where paths differ in multiple locations but are ultimately quite similar, to make sure the code is not "cheating"/over-fitted to simple cases (such pathological logs should be easy enough to generate using the mpi4py-style scripts we practiced a while back + a bit of creativity on the file paths/names chosen)

{"selector": "th", "props": [("border", "1px solid grey")]}
])

style = style.hide_index()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you pointed out on Slack, this can fail sometimes, probably depending on the pandas version in use. It looks like it was deprecated and removed--for more information on how you might solve this see: pandas-dev/pandas#43771

You may need to check the pandas version string and make a decision based on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample traceback:

Traceback (most recent call last):
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 89, in <module>
    main(log_path=args.log_path, output_path=args.output_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 77, in main
    style = style.hide_index()
            ^^^^^^^^^^^^^^^^
AttributeError: 'Styler' object has no attribute 'hide_index'. Did you mean: 'hide_index_'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, to_html() already has an argument to suppress the index, so I'm a bit confused why we'd even want to mess around with the styler at all here.

Studying the code base is often helpful, i.e., git grep -E -i "to_html" in the root of the repo will show some helpful examples

])

style = style.hide_index()
html = style.render()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is render being used here? Isn't that Jupyter notebook thing? I'm pretty sure to_html() is what we want. You may need to study the docs here a bit: https://pandas.pydata.org/docs/user_guide/style.html and https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_html.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if I fix your hide_index issue above locally, I then get this traceback, which is another reason we'd want to see the tests running/passing before we start to iterate ourselves (to make sure it works on a basic level):

Traceback (most recent call last):
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 89, in <module>
    main(log_path=args.log_path, output_path=args.output_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 78, in main
    html = style.render()
           ^^^^^^^^^^^^
AttributeError: 'Styler' object has no attribute 'render'. Did you mean: '_render'?

html = style.render()

with open(output_path, "w") as html_file:
html_file.write(html)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I fix both of the above issues and then try to run your code via the Python command line route, I still get an error:

python glob_feature.py -p ~/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan

Traceback (most recent call last):
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 90, in <module>
    main(log_path=args.log_path, output_path=args.output_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 81, in main
    with open(output_path, "w") as html_file:
         ^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

Some of the testing I mentioned a few weeks ago about handling the various output_path modalities seems to be missing? You'll want tests for the command line and module-based incantations to make sure they work as you iterate on your code.

@tylerjereddy
Copy link
Collaborator

If I hack around some of the bugs noted above, I can finally get your script to produce a table for e3sm_io_heatmap_only.darshan:

image

That looks a bit closer to what we want, though I'm not sure why you decide to remove the heatmap-style coloring from the table?

The process for moving forward might then be to ask the Argonne folks for feedback on that table given the raw/original filenames:

16592106915301738621                                      heatmap:POSIX
7713204788897956166                                darshan-apmpi-header
2250748219237254214                                               APMPI
3748348667804457392   /projects/radix-io/E3SM-IO-inputs/i_case_1344p.nc
3668870418325792824                                       heatmap:MPIIO
14734109647742566553                                            <STDIN>
15920181672442173319                                           <STDOUT>
7238257241479193519                                            <STDERR>
3989511027826779520                                       heatmap:STDIO
6966057185861764086      /projects/radix-io/snyder/e3sm/can_I_out_h0.nc
12386309978061520508     /projects/radix-io/snyder/e3sm/can_I_out_h1.nc

From Shane's feedback above, it sounds like we might want to switch \d to the more general [.*] or whatever to keep it simple for users. So you'd adjust your pytest test based on Shane's feedback, then make it pass again by adjusting your code accordingly. Then move on to the next log file, and so on, trying to reach consensus on the expected results as you go, and enforcing that the tests still pass for the old and new logs as you go, which should force the code to generalize more and more as the test cases get bigger/trickier.

@tylerjereddy
Copy link
Collaborator

I'll add a few review comments now that Yaris has pushed up a first attempt at working with pytest and the package more broadly.

There are a few things to address here:

  1. There's no good reason to have two copies of the glob_feature.py module in the same branch/pull request--they aren't even the same version of the code, which is extra confusing. Which location is correct? Well, there are currently no Python library modules (only setup.py, which is a special install file) in the darshan project root, while every other Python library module resides at a deeper level of nesting:
  • cd darshan-util
  • tree pydarshan -P "*.py"
pydarshan
├── benchmarks
│   └── benchmarks
│       ├── __init__.py
│       ├── dxt_heatmap.py
│       ├── log_utils.py
│       └── report.py
├── darshan
│   ├── __init__.py
│   ├── __main__.py
│   ├── backend
│   │   ├── __init__.py
│   │   ├── api_def_c.py
│   │   └── cffi_backend.py
│   ├── cli
│   │   ├── __init__.py
│   │   ├── __main__.py
│   │   ├── info.py
│   │   ├── name_records.py
│   │   ├── summary.py
│   │   └── to_json.py
│   ├── datatypes
│   │   ├── __init__.py
│   │   └── heatmap.py
│   ├── discover_darshan.py
│   ├── examples
│   │   ├── __init__.py
│   │   ├── darshan-graph
│   │   ├── example_logs
│   │   │   └── __init__.py
│   │   └── tutorial
│   │       ├── hello.py
│   │       ├── plot.py
│   │       └── tojson.py
│   ├── experimental
│   │   ├── __init__.py
│   │   ├── aggregators
│   │   │   ├── __init__.py
│   │   │   ├── agg_ioops.py
│   │   │   ├── create_dxttimeline.py
│   │   │   ├── create_sankey.py
│   │   │   ├── create_time_summary.py
│   │   │   ├── create_timeline.py
│   │   │   ├── mod_agg_iohist.py
│   │   │   ├── name_records_summary.py
│   │   │   ├── print_module_records.py
│   │   │   ├── records_as_dict.py
│   │   │   └── summarize.py
│   │   ├── operations
│   │   │   ├── filter.py
│   │   │   ├── merge.py
│   │   │   └── reduce.py
│   │   ├── plots
│   │   │   ├── __init__.py
│   │   │   ├── data_access_by_filesystem.py
│   │   │   ├── heatmap_handling.py
│   │   │   ├── plot_access_histogram.py
│   │   │   ├── plot_common_access_table.py
│   │   │   ├── plot_dxt_heatmap.py
│   │   │   ├── plot_dxt_heatmap2.py
│   │   │   ├── plot_io_cost.py
│   │   │   ├── plot_opcounts.py
│   │   │   └── plot_posix_access_pattern.py
│   │   └── transforms
│   │       └── dxt2png.py
│   ├── lib
│   │   └── accum.py
│   ├── log_utils.py
│   ├── report.py
│   └── tests
│       ├── __init__.py
│       ├── conftest.py
│       ├── input
│       │   └── __init__.py
│       ├── test_cffi_misc.py
│       ├── test_data_access_by_filesystem.py
│       ├── test_error.py
│       ├── test_heatmap_handling.py
│       ├── test_lib_accum.py
│       ├── test_log_utils.py
│       ├── test_moddxt.py
│       ├── test_modmpiio.py
│       ├── test_modposix.py
│       ├── test_modstdio.py
│       ├── test_plot_common_access_table.py
│       ├── test_plot_dxt_heatmap.py
│       ├── test_plot_exp_common.py
│       ├── test_plot_io_cost.py
│       ├── test_report.py
│       ├── test_report_copy.py
│       └── test_summary.py
├── devel
├── docs
│   └── conf.py
└── setup.py

So, the module at path darshan-util/pydarshan/darshan/glob_feature/glob_feature.py seems like a more sensible choice. Let's try making that change on this branch and then running the test you added:

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	deleted:    glob_feature.py

python -m pytest --pyargs darshan -k "test_glob_tables"

Now we get the same failure that you can see in the CI after you pushed up. When you see failures in the CI like that, it usually means there's a problem with your pull request/code:

============================================================================================= ERRORS ==============================================================================================
___________________________________________________________________________ ERROR collecting tests/test_glob_feature.py ___________________________________________________________________________
/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:9: in <module>
    print(sys.path)  # Print sys.path again
E   NameError: name 'sys' is not defined
----------------------------------------------------------------------------------------- Captured stdout -----------------------------------------------------------------------------------------
2.0.1
===================================================================================== short test summary info =====================================================================================
ERROR tests/test_glob_feature.py - NameError: name 'sys' is not defined
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================================================ 532 deselected, 1 error in 0.67s =================================================================================

Ok, so let's fix that error and try again:

(py_311_darshan_dev) treddy@pn2305118 pydarshan % git diff
diff --git a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
index 8a779857..b26cf309 100644
--- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,10 +6,8 @@ print(pd.__version__)
 from pandas.testing import assert_frame_equal
 import pytest
 import re 
-print(sys.path)  # Print sys.path again
 import glob_feature
 
-print("hello")
 @pytest.mark.parametrize("log_name, expected_df", [
      # grow this with more logs...
      ("e3sm_io_heatmap_only.darshan",

Now, we run the test again and get another failure--it looks like you're still having issues with Python packaging because the import doesn't work:

___________________________________________________________________________ ERROR collecting tests/test_glob_feature.py ___________________________________________________________________________
ImportError while importing test module '/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/opt/homebrew/Cellar/python@3.11/3.11.3/Frameworks/Python.framework/Versions/3.11/lib/python3.11/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:9: in <module>
    import glob_feature
E   ModuleNotFoundError: No module named 'glob_feature'

Well, the import works for every single other module under test in the project, so why not just study one of them as an example?

tree darshan/experimental -P "*init*.py"

darshan/experimental
├── __init__.py
├── __pycache__
├── aggregators
│   ├── __init__.py
│   └── __pycache__
├── operations
│   └── __pycache__
├── plots
│   ├── __init__.py
│   └── __pycache__
└── transforms

It looks like those __init__.py files might be useful, like I mentioned a few times in our discussions on Slack. I don't see one in your new path:

tree darshan/glob_feature

darshan/glob_feature
└── glob_feature.py

Ok, let's add that in, so:

new file:   darshan/glob_feature/__init__.py

Try the testing again, still get ModuleNotFoundError: No module named 'glob_feature'. Well, that makes sense, glob_feature isn't some external package, it is in our darshan package now that you've made the packaging changes to move it in, so let's adjust the import based on what we do for other modules in our package:

git grep -E -i "import heatmap"

from darshan.experimental.plots import heatmap_handling

So, for us, maybe something like:

--- a/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
+++ b/darshan-util/pydarshan/darshan/tests/test_glob_feature.py
@@ -6,7 +6,7 @@ print(pd.__version__)
 from pandas.testing import assert_frame_equal
 import pytest
 import re 
-import glob_feature
+from darshan.glob_feature import glob_feature
 
 @pytest.mark.parametrize("log_name, expected_df", [
      # grow this with more logs...

And now the test will actually run--of course it fails, and we've already discussed this failure before:

>       style = style.hide_index()
E       AttributeError: 'Styler' object has no attribute 'hide_index'

Part of the confusion may be that you had two copies of the module that weren't synced (that's always a terrible idea!), or maybe you copy-pasted something without genuinely understanding it, but in any case, this should help you move forward quite a bit, since if your local env is setup probably, the collection of changes above should allow you to run the test with:

python -m pytest --pyargs darshan -k "test_glob_tables"

There a quite a few other issues in the code, and one is so major that I'll add an inline comment right away, but the others I'm hoping you'll study carefully and clean up a bit.

outfile = os.path.join(cwd, "name_record_glob_hd5f.html")
glob_feature.main(log_path, outfile)
actual_table = pd.read_html(outfile)[0]
actual_table.drop("Unnamed: 0", axis=1, inplace=True) # Drop the "Unnamed: 0" column
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this? If there's a problem in the actual table, there's a problem in your actual code, and you should fix the code, not conceal the problem by mutating the test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess would be the index column perhaps, I'm not convinced you got a full understanding of the html conversion process just yet vs. pandas versions, so maybe look into that a bit more.

@tylerjereddy
Copy link
Collaborator

tylerjereddy commented Jun 14, 2023

I'll wait for another round of cleanups before commenting again--there's a lot of changes I suggested, and having two separate copies of the new code module in the same branch is basically insanity to review, so I won't try to guess what you're doing there. But hopefully I've given you a boost here at least...

@tylerjereddy
Copy link
Collaborator

Possible checklist for updating the PR might be something like:

  • make sure the full test suite passes for you locally, including your new tests
  • make sure you've carefully proofread your code with i.e., git show or other appropriate git commands to avoid accidental commits/typos, etc.
  • make sure the various command-line incantations of your module also work:
    • python glob_feature.py -p path/to/log.darshan
    • python glob_feature.py -p path/to/log.darshan -o output.html
  • Add a comment, in this PR, summarizing what has changed since your last update, what has improved, what still has problems, what you tried, what you need help with, etc.

@yariseidenbenz
Copy link
Author

Changes:
I got rid of the glob_feature.py script located in the /vast/home/yaris/darshan/darshan-util/pydarshan/darshan

So now the only glob_feature.py script is located in darshan/darshan-util/pydarshan/darshan/glob_feature.

I fixed the styling issues

  • back to
    html = style.to_html()
  • This method aligns the text in the glob_count column to the right
    style = style.set_properties(subset=["glob_count"], **{"text-align": "right"}).
  • Reapplied gmap which creates a nice color scheme for both filename_glob and glob_count
Screenshot 2023-06-19 at 10 59 40 AM

I also added a few new lines of code which iterates over the list of paths within a certain group finds differences and replaces those differences with [.*]. These new paths are then represented within the data frame.

Here are some examples:

This is the test for log file in apmpi/darshan-apmpi-2nodes-64mpi.darshan
Screenshot 2023-06-19 at 11 25 41 AM

This is the test for log file in hdf5_diagonal_write_only/hdf5_diagonal_write_1_byte_dxt.darshan
Screenshot 2023-06-19 at 11 28 53 AM

This is the test for log file partial_data_stdio/partial_data_stdio.darshan
Screenshot 2023-06-19 at 12 34 28 PM

Questions:
The paths are correct and are condensed but the [.] for each differing character is long. Should I restrict the amount of uses of [.]?

The ratio method in difflib library is what is used to group file paths together depending on if they are 80 % similar. This method does work but might not be the best way to do it. Filepaths might be very similar up to the 80 % ratio but afterwards have big differences. The matcher.ratio() method could be considered a weakness because it is not programatic enough.
Here is an example with the log file partial_data_stdio/partial_data_stdio.darshan
Without condensing
Screenshot 2023-06-19 at 12 58 19 PM
With condensing
Screenshot 2023-06-19 at 12 34 28 PM

As you can see the file paths are the same up until the last directory. With the current glob_feature.py script it would group everything together because of the similarity score and then add [.] for each differing character. However in this test cause it would probably be better to have two file paths displayed
/home/carns/working/dbg/darshan-examples/test.out
/home/carns/working/dbg/darshan-examples/foo[.
][.][.]

Is what I have currently acceptable or would you want the output above?
You gave me some resources on ndiff method in the difflib library which might help solve the issue and there are also machine learning clustering algorithms we could use.

Should I restrict the count of hdf files?
These are the hierarchal files that usually look something like
image
They usually end Dataset such as in the case above. In cases where there are many hdf dataset files, should I restrict their count because from my understanding aren't they technically one file?
Like if there is
/path/to/file/Datatset:000
/path/to/file/Datatset:001
Should the glob_count for both of these files be one because they are the same file or should the both have a combined glob_count of two?

To do:

  • Maybe restrict the usage of [.*] for more readable output
  • Maybe restrict count of hierarchal files
  • Look into ndiff method
  • Try to find alternative to current ratio method being used in difflib

@tylerjereddy
Copy link
Collaborator

tylerjereddy commented Jun 19, 2023

Looks like a step in the right direction now.

Should I restrict the amount of uses of [.]?

Yes, you should only have one [.*] per region that has a difference, irrespective of the size of the difference, but all flanking matches should be explicit and not replaced, I'm pretty sure that is what Shane meant. That's also the whole point of our meetings--to make sure you are 100 % certain what the team wants--if it isn't 100 % clear what you should do for a given input file (and some of the files above are repeated from the meeting), you should try to clear that up in the meetings, but doing it here is the second best way at least!

partial_data_stdio.darshan should probably have: /home/carns/working/dbg/darshan-examples/foo[.*] and /home/carns/working/dbg/darshan-examples/test.out. However, if all the foo.. files actually start with 6, then foo6[.*] should probably be used to show the exact point where they differ vs. the same.

As for algorithms, I only sent you the ndiff as an example, the point is to carefully compare the various combinations of strings and compare them with a higher degree of granularity than just the similarity ratio. I don't see the pytest testing expanded for the new files you've discussed above, so maybe start to expand the parametrized test for those cases.

Also, this is still broken on your branch, so you should try to fix it and add a regression test using pytest similar to what we discussed the other day for the main HTML reports that also get tested for their command line incantation:

python glob_feature.py -p ~/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan

Traceback (most recent call last):
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 95, in <module>
    main(log_path=args.log_path, output_path=args.output_path)
  File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/glob_feature/glob_feature.py", line 86, in main
    with open(output_path, "w") as html_file:
         ^^^^^^^^^^^^^^^^^^^^^^
TypeError: expected str, bytes or os.PathLike object, not NoneType

The incantation with -o does work at least, so that's good, though you should add a test for that incantation as well, to make sure it doesn't break as you make changes.

Finally, the CI is failing for this PR, so you should try to add in the appropriate dependency for now, so that we can see your tests passing or failing appropriately.

@tylerjereddy
Copy link
Collaborator

In case it wasn't clear from the previous meeting, here is some guidance on the approach to testing/vision for moving forward:

  1. It makes sense to continue expanding test_glob_tables with more cases from darshan-logs, which you already appear to be working on, making sure there's a reasonable consensus on the expected values from the team.
  2. It also makes sense to come up with your own test cases, for example drawing out your own file tree and then testing for the appropriate clustering/globbing behavior on those data structures. This is likely to be a separate test, since the input is the data structure that gets extract from the log rather than the log proper, as discussed at the last meeting. The advantage is that you don't have to generate logs this way, so darshan-logs doesn't fill up too much just for this one thing. Creativity in designing these test cases may mean that you have filepath trees that differ in 1, 2, 3, 4, different locations (start, middle, end of file paths), or have tons of files that aren't related, etc.
  3. Once you've got the above tests, with agreement from the team on expected values, encoded into pytest test cases, you can start to worry about the algorithm that will accomplish the clustering/globbing task. If you genuinely don't think difflib and the Python standard library won't suffice, it will be important to clearly communicate why that is, and exactly what capability you'll need. As it stands, it looks to me like a combination of difflib and careful manual algorithm design may do the trick, but if there is a clear and concise explanation of why the plethora of capabilities in difflib and the Python standard library don't suffice, and you can show the tests passing for the large number of diverse cases described above, I suspect we might be swayed to deviate form the standard library at that point.

At the end of the project, even if the final product from the summer is basically just a bunch of agreed-upon but very well crafted test cases, that's already considerable value delivered, since the task of identifying exactly what folks want is a big part of the challenge of a software project, and encoding that in tests will make it easier to move forward.

… uses agglomerative hierarchal clustering for grouping. The test_glob_feature.py expanded to more log files. The dependencies for these scripts were added to main_ci.yml.
@github-actions github-actions bot added the CI continuous integration label Jul 24, 2023
@yariseidenbenz
Copy link
Author

Description of Changes:
I have made significant improvements to the file path grouping process by adopting agglomerative hierarchical clustering from the scikit-learn library. This method is more efficient than the previous difflib sequence matcher approach. Moreover, the number of clusters is no longer fixed; it is now determined dynamically using the silhouette method. This dynamic clustering capability allows the script to handle various test cases of different sizes more effectively.

Additionally, I have expanded the test_glob_feature.py to include more log files from the darshan log repository. This expansion ensures a more comprehensive testing process, covering a broader range of scenarios and data.

I've added the necessary dependencies to the main_ci.yml.

Next Steps:
Create a verbose option to display all the file paths if the user wishes to.

Moreover, I am currently exploring ways to display more commonalities within the representing file path. While the current implementation effectively finds differences and replaces them with [.*], there is room for improvement in handling commonalities at the end of file paths. For example, if multiple file paths have common elements at their endings but different positions, these commonalities are currently not accounted for in the representing file path. To address this, I am considering indexing from the back to the front and concatenating the common elements to the merged_path, which would create a more informative and accurate representing file path.

Here is an example of the issue:
Say this is a group:
path/to/file000.nc
path/to/file9999000.nc
path/to/file8888888000.nc

The reprsenting filepath would be
path/to/file[.*].nc glob count: 3

Even though the file paths all contain 000 in the last directory.
To me it would make sense to make the ideal representing path: path/to/file[.*]000.nc glob count: 3

Any insight on how to approach this problem?

Here are the lines of code that deal with switching differing characters within the group with [.*] and finds if the paths within a group have a common file extensions: # these lines create the representing file path for a group

    new_paths = []
    for _, group in grouped_paths.items():
        if len(group) > 1:
            merged_path = ""
            max_length = max(len(path) for path in group)
            differing_chars_encountered = False
            common_extension = None


            for i in range(max_length):
                chars = set(path[i] if len(path) > i else "" for path in group)
                if len(chars) == 1:
                    merged_path += chars.pop()
                    differing_chars_encountered = True
                else:
                    if differing_chars_encountered:
                        merged_path += "[.*]"
                        differing_chars_encountered = False

            # Checks if all paths have the same file extension
            extensions = [os.path.splitext(path)[1] for path in group]
            common_extension = None
            if len(set(extensions)) == 1:
                common_extension = extensions[0]

            # Append the common extension if it exists and it's not already in the merged_path
            if common_extension and common_extension not in merged_path:
                merged_path += common_extension

            new_paths.append((merged_path, len(group)))
        else:
            new_paths.append((group[0], 1))

print(expected_df)

# Compare the two DataFrames
diff = actual_table['filename_glob'].compare(expected_df['filename_glob'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is causing problems--I see this locally:

            # Compare the two DataFrames
>           diff = actual_table['filename_glob'].compare(expected_df['filename_glob'])

/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/tests/test_glob_feature.py:301: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/series.py:3185: in compare
    return super().compare(
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/generic.py:9212: in compare
    mask = ~((self == other) | (self.isna() & other.isna()))
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/ops/common.py:81: in new_method
    return method(self, other)
/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/arraylike.py:40: in __eq__
    return self._cmp_method(other, operator.eq)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = 0              /home/shane/software/ior/build/testFile
1    /home/shane/software/ior/build/testFile:/Datas...
Name: filename_glob, dtype: object
other = 0    /home/shane/software/ior/build/testFile[.*]
Name: filename_glob, dtype: object, op = <built-in function eq>

    def _cmp_method(self, other, op):
        res_name = ops.get_op_result_name(self, other)
    
        if isinstance(other, Series) and not self._indexed_same(other):
>           raise ValueError("Can only compare identically-labeled Series objects")
E           ValueError: Can only compare identically-labeled Series objects

/Users/treddy/python_venvs/py_311_darshan_dev/lib/python3.11/site-packages/pandas/core/series.py:6090: ValueError

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really no reason to have this line present anymore, the assert_frame_equal should suffice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I delete that stuff, then I see a bunch of test failures like this:

FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_bytes_range_dxt.darshan-expected_df6] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[runtime_and_dxt_heatmaps_diagonal_write_only.darshan-expected_df20] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_half_flush_dxt.darshan-expected_df7] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[treddy_runtime_heatmap_inactive_ranks.darshan-expected_df21] - ValueError: empty vocabulary; perhaps the documents only contain stop words
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_half_ranks_dxt.darshan-expected_df8] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_file_opens_only.darshan-expected_df9] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[darshan-apmpi-2nodes-64mpi.darshan-expected_df2] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[hdf5_diagonal_write_1_byte_dxt.darshan-expected_df5] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[imbalanced-io.darshan-expected_df11] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[shane_ior-HDF5_id438090-438090_11-9-41522-17417065676046418211_1.darshan-expected_df12] - AssertionError: DataFrame are different
FAILED tests/test_glob_feature.py::test_glob_tables[shane_ior-PNETCDF_id438100-438100_11-9-41525-10280033558448664385_1.darshan-expected_df13] - AssertionError: DataFrame are different

"glob_count": [2, 1]})),


("hdf5_diagonal_write_1_byte_dxt.darshan",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking this log file as an example of something you might want to get feedback from the Argonne team on.

The original DataFrame seems to be this:

6329959920961583796                                  /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__init__.py
12081676647536809530         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/__init__.cpython-38.pyc
9019270385924777942       /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_errors.cpython-38-x86_64-linux-gnu.so
17233738746347876168                                  /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/version.py
10563276375451711199          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/version.cpython-38.pyc
16409530009729583527           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5.cpython-38-x86_64-linux-gnu.so
13248362266925776299         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/defs.cpython-38-x86_64-linux-gnu.so
4638787767486382508      /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_objects.cpython-38-x86_64-linux-gnu.so
10170641380995794425                            /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5py_warnings.py
8697898997669214892     /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/__pycache__/h5py_warnings.cpython-38.pyc
7051742911845937566         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_conv.cpython-38-x86_64-linux-gnu.so
9236461081719577833           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5r.cpython-38-x86_64-linux-gnu.so
2181399625010544346           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5t.cpython-38-x86_64-linux-gnu.so
6528078802042034082           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5p.cpython-38-x86_64-linux-gnu.so
2575676175211135484           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5s.cpython-38-x86_64-linux-gnu.so
11727641573666397676        /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/utils.cpython-38-x86_64-linux-gnu.so
17901672647343149144         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ac.cpython-38-x86_64-linux-gnu.so
9518436409801048664                                             /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/logging/__init__.py
2863652681302921111                     /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/logging/__pycache__/__init__.cpython-38.pyc
2204154745069321427                                                    /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/traceback.py
555215874445445411                             /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/traceback.cpython-38.pyc
3096762644015249429                                                       /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/string.py
3433981731425580030                               /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/string.cpython-38.pyc
6622959456894657244           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5z.cpython-38-x86_64-linux-gnu.so
16198910964451933166          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5a.cpython-38-x86_64-linux-gnu.so
12844965888207067364       /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_proxy.cpython-38-x86_64-linux-gnu.so
8320392657652582313           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5d.cpython-38-x86_64-linux-gnu.so
2182783899281604896          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ds.cpython-38-x86_64-linux-gnu.so
6824284610368689248           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5f.cpython-38-x86_64-linux-gnu.so
2255611827735696210           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5g.cpython-38-x86_64-linux-gnu.so
7369759469852332435           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5i.cpython-38-x86_64-linux-gnu.so
3786189409375561029          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5fd.cpython-38-x86_64-linux-gnu.so
9453911172532451432          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5pl.cpython-38-x86_64-linux-gnu.so
7281569168338684097                              /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__init__.py
6456765083458423379      /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/__init__.cpython-38.pyc
1619390675886412049                               /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/filters.py
6004586721157325469       /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/filters.cpython-38.pyc
7590428293607185835                                /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/compat.py
941427344463918395         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/compat.cpython-38.pyc
7880173344308596059                                  /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/base.py
412714883876212763           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/base.cpython-38.pyc
8629723084015159315                                 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/files.py
1259607287293063192         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/files.cpython-38.pyc
1754798185250444829                                 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/group.py
14745068672433573323        /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/group.cpython-38.pyc
8305533062347777536           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5o.cpython-38-x86_64-linux-gnu.so
15958169687914591130          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5l.cpython-38-x86_64-linux-gnu.so
11692197895731416386                              /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/dataset.py
18374151976672635155      /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/dataset.cpython-38.pyc
5947139205337478310     /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_selector.cpython-38-x86_64-linux-gnu.so
4123734041483639167                            /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/selections.py
17375372070410440550   /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/selections.cpython-38.pyc
11306757759007388579                          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/selections2.py
5194722322225707588   /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/selections2.cpython-38.pyc
13359638461944338488                             /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/datatype.py
17244020578242913766     /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/datatype.cpython-38.pyc
16568253267388065553                                  /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/vds.py
8979639072945830616           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/vds.cpython-38.pyc
15318755126620508599                                                        /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/copy.py
12711741908924247827                                /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/copy.cpython-38.pyc
7714427140906174735                                 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/attrs.py
9344690504834470365         /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/_hl/__pycache__/attrs.cpython-38.pyc
16918611628337716751                                                        /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/uuid.py
8511949921801793794                                 /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/__pycache__/uuid.cpython-38.pyc
2130744796581944128                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_0.h5
17389075414465034664                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_0.h5:/dataset
4223954931879612592                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_1.h5
14104557419321271749                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_1.h5:/dataset
3586708446079182075                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_2.h5
14252080315496364280                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_2.h5:/dataset
17916052597340637289                                              /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_3.h5
15211073457117916342                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_3.h5:/dataset
9342247733453943503                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_4.h5
10985895144053482380                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_4.h5:/dataset
4898891139486729695                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_5.h5
15357467061712271470                                     /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_5.h5:/dataset
3386331313084499192                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_6.h5
409581163713740906                                       /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_6.h5:/dataset
7972149951920014792                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_7.h5
2753966606541326699                                      /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_7.h5:/dataset
5562527676113658148                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_8.h5
6263533231926069969                                      /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_8.h5:/dataset
8261993854308713163                                               /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5
1478452523172783126                                      /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5:/dataset

That table has file paths including the following:

8320392657652582313           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5d.cpython-38-x86_64-linux-gnu.so
2182783899281604896          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5ds.cpython-38-x86_64-linux-gnu.so
6824284610368689248           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5f.cpython-38-x86_64-linux-gnu.so
2255611827735696210           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5g.cpython-38-x86_64-linux-gnu.so
7369759469852332435           /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5i.cpython-38-x86_64-linux-gnu.so
3786189409375561029          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5fd.cpython-38-x86_64-linux-gnu.so
9453911172532451432          /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/h5pl.cpython-38-x86_64-linux-gnu.so

But the granularity in the test below is obviously coarser than clustering that group individually, so may be good to ask the Argonne team what they want in this case. Your actual and expected values do still differ here as well, so I'm curious if you are still maintaining that the "expected" value is the right one here?

actual_table:
                                                                            filename_glob  glob_count
0                      /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*]          64
1  /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5[.*]          20
expected_df:
                                                                            filename_glob  glob_count
0   /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/site-packages/h5py/[.*]          54
1  /yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5[.*]          20
2                      /users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*]          10

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully, it doesn't throw a wrench in anything, but after looking closer at the filenames for this example, I think I'd actually suggest something like this for expected output (ignoring the glob counts, I didn't implement this so don't have the final counts handy):

/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].py
/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].pyc
/users/nawtrey/.conda/envs/pydarshan_hdf5_py38/lib/python3.8/[.*].so
/yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_[.*].h5

So, at a high-level it tries to find the "common" directories that have lots of files in them, then within those common directories bins things based on suffix? I think that works well for this particular scenario, but maybe others have different thoughts. Of course, you'll have to have some algorithmic way to decide when to stop splitting into common subdirectories (and start trying to group into suffixes), if that makes sense.

Note, that I'm also ignoring the weird HDF5 record names that include the dataset (e.g., '/yellow/users/nawtrey/projects/hdf5_testing/test_files_write_1_bytes/test_9.h5:/dataset'). I still think we should explicitly ignore those. The only way I know how to do that is to restrict record names to only those coming from POSIX and STDIO modules:

  • Read in POSIX and STDIO module data
  • Remove all name records that don't have a corresponding match in the POSIX/STDIO record list
  • Perform analysis on those records

@yariseidenbenz
Copy link
Author

Here are the file paths for acme_fcase_trace log file:
Screenshot 2023-07-21 at 3 11 35 PM

Here are the 3 file paths that are being grouped together in the last row under the representing file path name (/projects/ccsm/inputdata/[.]n.[.].[.]1[.]0[.*].nc)

/projects/ccsm/inputdata/share/domains/domain.lnd.ne120np4_gx1v6.110502.nc
/projects/ccsm/inputdata/ocn/docn7/domain.ocn.1x1.111007.nc
/projects/ccsm/inputdata/share/domains/domain.ocn.ne120np4_gx1v6.121113.nc

It would probably be more beneficial to display them individually rather than grouped in this case.

@shanedsnyder
Copy link
Contributor

Here are the 3 file paths that are being grouped together in the last row under the representing file path name (/projects/ccsm/inputdata/[.]n.[.].[.]1[.]0[.*].nc)

/projects/ccsm/inputdata/share/domains/domain.lnd.ne120np4_gx1v6.110502.nc /projects/ccsm/inputdata/ocn/docn7/domain.ocn.1x1.111007.nc /projects/ccsm/inputdata/share/domains/domain.ocn.ne120np4_gx1v6.121113.nc

It would probably be more beneficial to display them individually rather than grouped in this case.

I think most of this table looks good, but I have a different thought on what might work best for the issue above. I'd actually recommend either only:

/projects/ccsm/inputdata/[.*].nc

Or the list of:

/projects/ccsm/inputdata/atm/[.*].nc
/projects/ccsm/inputdata/ocn/[.*].nc
/projects/ccsm/inputdata/share/[.*].nc

Does that simplify anything or make things more complicated? I think either of those options are more helpful than the current output.

Note, that what we decide there could also affect what we decide to do with these 2 further up:

/projects/ccsm/inputdata/atm/cam/physprops/[.*].nc
/projects/ccsm/inputdata/atm/cam/[.*].nc

There's also a couple of examples that are maybe more complex than they need to be visually (I know that's not always easy to square with what the algorithm is doing):

test_F_case_cetus_dxt.[.*]00[.*]
run/timing/[.*]i[.*]

In both cases, there's a couple of groups of "match anything" blocks ([.*]) separated by one or 2 characters in between them. I think they'd look better if they were just collapsed down into a single [.*] (throwing out the additional context of the '00' or the 'i'). Not sure if that's easy to accomplish, but maybe you avoid having such a short run of characters between two [.*] blocks?

@yariseidenbenz
Copy link
Author

yariseidenbenz commented Jul 31, 2023

Screenshot 2023-07-31 at 2 15 21 PM

@shanedsnyder Here is the same test case (acme_fcase_trace) but after it groups the common prefixes together it then groups by file extensions.

Would this be desired output?

Here is another example with the hdf5_diagonal_write_1_byte test case:
Screenshot 2023-07-31 at 2 40 17 PM

@shanedsnyder
Copy link
Contributor

Yeah, I find those to be a lot more helpful personally, but maybe other folks have a different thought.

My only additional comment is for the first example you shared, I notice that all of the files that don't match a glob (i.e., count of 1, no wildcard) do all share a common prefix that's pretty deep into the path name -- I wonder if in that case if we should instead glob them all into one /whatever/run/[.*] glob? I think it'd look cleaner visually at least to have those further grouped if it's not too complicated or seemingly arbitrary or something like that?

Thinking out loud a little more, regardless of what we decide above, it might be nice if there was like a "verbose" option to the tool that could spit out the entire list of all matching files in each glob (maybe indented right beneath the glob patterns in the tables you already have)? That way even if the tool doesn't group things in a way that's super helpful for users, they can still check and see what all files got matched to each glob, if that makes sense?

…cal clustering and common file extension.

It also has a verbose option (-v) which displays all the files within the respective groups.
I added some modifications for the ideal values in the test_glob_feature.py script
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI continuous integration pydarshan
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants