Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mdtest: -I and -n behavior #106

Closed
marcvef opened this issue Oct 17, 2018 · 5 comments
Closed

mdtest: -I and -n behavior #106

marcvef opened this issue Oct 17, 2018 · 5 comments

Comments

@marcvef
Copy link
Contributor

marcvef commented Oct 17, 2018

I've been testing the current master for mdtest and I ran into some issues with the -I and -n parameters which seem to have fundamentally changed with commit 0870ad7 causing wrong output and floating point exceptions. Below I compare the current behavior with prior behavior.

-I argument

For instance, a concurrent file creation benchmark in a single directory could be run like this: mpiexec -n 4 src/mdtest -a POSIX -z 0 -b 1 -i 1 -d /tmp/test -I 10 -F with a depth of 0 resulting in 10 files per process being created in the same directory. The resulting mdtest output is the following:

-- started at 10/17/2018 11:33:12 --

mdtest-1.9.3 was launched with 4 total task(s) on 1 node(s)
Command line used: src/mdtest "-a" "POSIX" "-z" "0" "-b" "1" "-i" "1" "-d" "/tmp/test" "-I" "10" "-F"

Path: /tmp
FS: 15.6 GiB   Used FS: 0.3%   Inodes: 3.9 Mi   Used Inodes: 0.0%

4 tasks, 0 files

SUMMARY rate: (of 1 iterations)
   Operation                      Max            Min           Mean        Std Dev
   ---------                      ---            ---           ----        -------
   File creation     :          0.000          0.000          0.000          0.000
   File stat         :          0.000          0.000          0.000          0.000
   File read         :          0.000          0.000          0.000          0.000
   File removal      :          0.000          0.000          0.000          0.000
   Tree creation     :     121998.787     121998.787     121998.787          0.000
   Tree removal      :     172246.533     172246.533     172246.533          0.000

-- finished at 10/17/2018 11:33:12 --

In fact, the 40 file workload is correctly run by mdtest, but because items_per_dir is set by the -I parameter and is no longer assigned to the items variable (used to calculate the throughput), mdtest shows 0 in all tests. The reason is that -I is now dependent on -u (see https://github.com/hpc/ior/blob/master/src/mdtest.c#L2340-L2342 ), which I am not sure why this has been added? However, -u creates a completely different workload (each process operating on its own directory instead of all in the same directory). I understand that I could just use -n 10 instead of -I 10. Nevertheless, -I without -u still shows different evaluation than to what was actually executed.

-n argument

Floating point exception

When I execute mpiexec -n 4 src/mdtest -a POSIX -z 1 -b 3 -i 1 -d /tmp/test -n 4 -F, mdtest should create 4 directories (1 root 3 leaf) and then the workload is distributed across all directories, i.e., each process creates,stat,removes 1 file in each directory and this is how it worked in the past. At the moment, mdtest exits with a floating point exception (divide-by-zero. see error below) because num_dirs_in_tree is 0 (used to calculate the number of files each process should process in each directory). Prior this variable was set correctly and the test ran as expected.

Error message (click)
[evie:27297] *** Process received signal ***                                                                                                                                                                                                   
[evie:27297] Signal: Floating point exception (8)                                                                                                                                                                                              
[evie:27297] Signal code: Integer divide-by-zero (1)                                                                                                                                                                                           
[evie:27297] Failing at address: 0x55e51ba01863                                                                                                                                                                                                
[evie:27295] *** Process received signal ***                                                                                                                                                                                                   
[evie:27295] Signal: Floating point exception (8)                                                                                                                                                                                              
[evie:27295] Signal code: Integer divide-by-zero (1)                                                                                                                                                                                           
[evie:27295] Failing at address: 0x56083a9cb863                                                                                                                                                                                                
[evie:27296] *** Process received signal ***                                                                                                                                                                                                   
[evie:27296] Signal: Floating point exception (8)                                                                                                                                                                                              
[evie:27296] Signal code: Integer divide-by-zero (1)                                                                                                                                                                                           
[evie:27296] Failing at address: 0x563ec590e863                                                                                                                                                                                                
-- started at 10/17/2018 12:05:16 --                                                                                                                                                                                                           
                                                                                                                                                                                                                                               
mdtest-1.9.3 was launched with 4 total task(s) on 1 node(s)                                                                                                                                                                                    
Command line used: src/mdtest "-a" "POSIX" "-z" "1" "-b" "3" "-i" "1" "-d" "/tmp/test" "-n" "4" "-F" "-C"                                                                                                                                      
[evie:27294] *** Process received signal ***                                                                                                                                                                                                   
[evie:27294] Signal: Floating point exception (8)                                                                                                                                                                                              
[evie:27294] Signal code: Integer divide-by-zero (1)                                                                                                                                                                                           
[evie:27294] Failing at address: 0x55a8d2de7863                                                                                                                                                                                                
[evie:27296] [ 0] [evie:27297] [ 0] /lib64/libpthread.so.0(+0x144c0)[0x7f5d52e9b4c0]                                                                                                                                                           
[evie:27297] [evie:27295] [ 0] /lib64/libpthread.so.0(+0x144c0)[0x7fd3e8e634c0]                                                                                                                                                                
[evie:27295] [ 1] src/mdtest(+0xf863)[0x56083a9cb863]                                                                                                                                                                                          
/lib64/libpthread.so.0(+0x144c0)[0x7f86f28134c0]                                                                                                                                                                                               
[evie:27296] [ 1] src/mdtest(+0xf863)[0x563ec590e863]                                                                                                                                                                                          
[evie:27296] [ 2] src/mdtest(+0x346b)[0x563ec590246b]                                                                                                                                                                                          
[evie:27296] [ 3] [evie:27294] [ 0] /lib64/libpthread.so.0(+0x144c0)[0x7fc9f3a754c0]                                                                                                                                                           
[evie:27294] [ 1] src/mdtest(+0xf863)[0x55a8d2de7863]                                                                                                                                                                                          
[evie:27294] [ 2] src/mdtest(+0x346b)[0x55a8d2ddb46b]                                                                                                                                                                                          
[evie:27294] [ 3] [ 1] src/mdtest(+0xf863)[0x55e51ba01863]                                                                                                                                                                                     
[evie:27297] [ 2] src/mdtest(+0x346b)[0x55e51b9f546b]                                                                                                                                                                                          
[evie:27297] [ 3] [evie:27295] [ 2] src/mdtest(+0x346b)[0x56083a9bf46b]                                                                                                                                                                        
[evie:27295] [ 3] /lib64/libc.so.6(__libc_start_main+0xf1)[0x7f86f245e011]                                                                                                                                                                     
[evie:27296] [ 4] src/mdtest(+0x34aa)[0x563ec59024aa]                                                                                                                                                                                          
[evie:27296] *** End of error message ***                                                                                                                                                                                                      
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7f5d52ae6011]                                                                                                                                                                                       
[evie:27297] [ 4] src/mdtest(+0x34aa)[0x55e51b9f54aa]                                                                                                                                                                                          
[evie:27297] *** End of error message ***                                                                                                                                                                                                      
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fd3e8aae011]                                                                                                                                                                                       
[evie:27295] [ 4] src/mdtest(+0x34aa)[0x56083a9bf4aa]                                                                                                                                                                                          
[evie:27295] *** End of error message ***                                                                                                                                                                                                      
/lib64/libc.so.6(__libc_start_main+0xf1)[0x7fc9f36c0011]                                                                                                                                                                                       
[evie:27294] [ 4] src/mdtest(+0x34aa)[0x55a8d2ddb4aa]                                                                                                                                                                                          
[evie:27294] *** End of error message ***                                                                                                                                                                                                      
--------------------------------------------------------------------------                                                                                                                                                                     
mpiexec noticed that process rank 1 with PID 0 on node evie exited on signal 8 (Floating point exception).

Representation of results

When I execute mpiexec -n 4 src/mdtest -a POSIX -z 1 -b 3 -i 1 -d /tmp/test -n 4 -I 4 -F, I am actually not quite sure what I am telling mdtest to do. I am a bit confused about the interaction of -n and -I now in general. Prior, both parameters couldn't be used at the same time. Perhaps the documentation could make it more clear? With above command, mdtest creates,stats,removes 4 files per process in each directory, i.e., 16 files in total per process and 64 files in total. However, mdtest tells me that the workload is 16 files instead of 64 files:

[...]
FS: 15.6 GiB   Used FS: 0.3%   Inodes: 3.9 Mi   Used Inodes: 0.0%

4 tasks, 16 files

SUMMARY rate: (of 1 iterations)
[...]

In this context, I am not sure if the calculation of the throughput is even correct then. In general, I think it would be a good idea to explicitly output the number of files/directories in total, per process, and per directory to avoid confusion.

Sorry for the long write-up. I've been using mdtest for some time now and I am a bit confused on this new behavior as it possibly breaks scripts many users who use these parameters.

Thanks!

JulianKunkel added a commit that referenced this issue Oct 17, 2018
@JulianKunkel
Copy link
Collaborator

Please check the bugfix branch which seems to address the bug.
Note that in your second example, mpiexec -n 4 src/mdtest -a POSIX -z 1 -b 3 -i 1 -d /tmp/test -n 4 -F
Actually 4 files are created per process, I think that was the expected behavior.

Regarding the documentation, this will be done.
What it does when specifying e.g.:
./src/mdtest -a POSIX -z 1 -b 3 -i 1 -d /tmp/test -n 40 -I 4 -F
It will create batches of subdirectories containing 4 items (-I 4) until 40 (-n 40) is satisfied. This allows to test for extreme large directories while the file system may have a limit on the number of files allowed per directory. Note that it emulates the behavior, the performance might be a bit different.

@marcvef
Copy link
Contributor Author

marcvef commented Oct 17, 2018

Thanks for the quick fix! The issues with -I and the floating point exception were fixed. You are right, it should create 4 files per process.

What it does when specifying e.g.:
./src/mdtest -a POSIX -z 1 -b 3 -i 1 -d /tmp/test -n 40 -I 4 -F
It will create batches of subdirectories containing 4 items (-I 4) until 40 (-n 40) is satisfied.

That makes sense and it looks like mdtest is doing exactly that. So actually only the computation for the throughput is incorrect. It only relies on items which is set by -N. It doesn't take num_dirs_in_tree into account albeit having the correct value (same for stating the number of files in the output). See e.g.,

ior/src/mdtest.c

Line 1265 in f4afa63

summary_table[iteration].rate[4] = items*size/(t[1] - t[0]);
for create throughput calculation.

@JulianKunkel
Copy link
Collaborator

JulianKunkel commented Oct 17, 2018 via email

@marcvef
Copy link
Contributor Author

marcvef commented Oct 17, 2018

Great, thanks! In terms of correctness, code and behavior look good to me.

@glennklockwood
Copy link
Contributor

Thanks for fixing this so quickly @JulianKunkel. Can you merge the fix branch into master? I'll cherry-pick it up into RC for the 3.2.0 release from there.

glennklockwood pushed a commit to glennklockwood/ior that referenced this issue Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants