Bug when calculating landscape analysis agglomerative clustering mean volumes #423

michal-g · 2024-11-18T23:40:05Z

@alkinkaz found that a recent change to analyze_landscape introduced a bug in how mean volumes were being calculated for the classes recovered through agglomerative clustering of sketched volumes (saved under e.g. landscape.39/clustering_L2_average_10/state_<i>_mean.mrc and state_<i>_std.mrc):

cryodrgn/cryodrgn/commands/analyze_landscape.py

Lines 340 to 363 in 1a1a7da

    
           for i in range(M): 
        
               vol_i = np.where(labels == i)[0] 
        
               logger.info(f"State {i}: {len(vol_i)} volumes") 
        
               if vol_ind is not None: 
        
                   vol_i = np.arange(K)[vol_ind][vol_i] 
        
               vol_i_all = torch.stack( 
        
                   [ 
        
                       ImageSource.from_file( 
        
                           f"{outdir}/kmeans{K}/vol_{vol_start_index+i:03d}.mrc" 
        
                       ).images() 
        
                       for i in vol_i 
        
                   ] 
        
               ) 
        
               nparticles = np.array([kmeans_counts[i] for i in vol_i]) 
        
               vol_i_mean = np.average(vol_i_all, axis=0, weights=nparticles) 
        
               vol_i_std = ( 
        
                   np.average((vol_i_all - vol_i_mean) ** 2, axis=0, weights=nparticles) ** 0.5 
        
               ) 
        
               write_mrc( 
        
                   f"{subdir}/state_{i}_mean.mrc", vol_i_mean.astype(np.float32), Apix=Apix 
        
               ) 
        
               write_mrc( 
        
                   f"{subdir}/state_{i}_std.mrc", vol_i_std.astype(np.float32), Apix=Apix 
        
               )

became

cryodrgn/cryodrgn/commands/analyze_landscape.py

Lines 348 to 366 in 9956ded

    
           for i in range(M): 
        
               vol_i = np.where(labels == i)[0] 
        
               logger.info(f"State {i}: {len(vol_i)} volumes") 
        
               if vol_ind is not None: 
        
                   vol_i = np.arange(K)[vol_ind][vol_i] 
        
               vol_fl = os.path.join(outdir, f"kmeans{K}", f"vol_{vol_start_index+i:03d}.mrc") 
        
               vol_i_all = torch.stack([ImageSource.from_file(vol_fl).images() for i in vol_i]) 
        
               nparticles = np.array([kmeans_counts[i] for i in vol_i]) 
        
               vol_i_mean = np.average(vol_i_all, axis=0, weights=nparticles) 
        
               vol_i_std = ( 
        
                   np.average((vol_i_all - vol_i_mean) ** 2, axis=0, weights=nparticles) ** 0.5 
        
               ) 
        
               write_mrc( 
        
                   f"{subdir}/state_{i}_mean.mrc", vol_i_mean.astype(np.float32), Apix=Apix 
        
               ) 
        
               write_mrc( 
        
                   f"{subdir}/state_{i}_std.mrc", vol_i_std.astype(np.float32), Apix=Apix 
        
               )

Thus vol_i_all became an iteration over the same volume vol_fl indexed by the current i, instead of iterating over all the volumes in the cluster as indexed by vol_i. This manifests itself most clearly in producing an empty volume under state_<i>_std.mrc.

This has been fixed at 6a1e93e, where we also improved the clarity of the code to make it more robust against future bugs:

cryodrgn/cryodrgn/commands/analyze_landscape.py

Lines 351 to 369 in 6a1e93e

    
           for cluster_i in range(M): 
        
               vol_indices = np.where(labels == cluster_i)[0] 
        
               logger.info(f"State {cluster_i}: {len(vol_indices)} volumes") 
        
               if vol_ind is not None: 
        
                   vol_indices = np.arange(K)[vol_ind][vol_indices] 
        
               vol_fls = [ 
        
                   os.path.join(kmean_dir, f"vol_{vol_start_index + vol_i:03d}.mrc") 
        
                   for vol_i in vol_indices 
        
               ] 
        
               vol_i_all = torch.stack( 
        
                   [torch.Tensor(parse_mrc(vol_fl)[0]) for vol_fl in vol_fls] 
        
               ) 
        
               nparticles = np.array([kmeans_counts[vol_i] for vol_i in vol_indices]) 
        
               vol_i_mean = np.average(vol_i_all, axis=0, weights=nparticles) 
        
               vol_i_std = ( 
        
                   np.average((vol_i_all - vol_i_mean) ** 2, axis=0, weights=nparticles) ** 0.5 
        
               )

These changes will be part of the upcoming v3.4.3 release and are presently available through our beta release channel:

pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'cryodrgn<=3.4.3' --pre

The text was updated successfully, but these errors were encountered:

michal-g self-assigned this Nov 18, 2024

michal-g changed the title ~~Bug when calculating landscape analysis k-means mean volumes~~ Bug when calculating landscape analysis agglomerative clustering mean volumes Nov 18, 2024

michal-g mentioned this issue Dec 20, 2024

v3.4.3: Making movies, improving filtering interface, and fixing landscape analysis #427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug when calculating landscape analysis agglomerative clustering mean volumes #423

Bug when calculating landscape analysis agglomerative clustering mean volumes #423

michal-g commented Nov 18, 2024 •

edited

Loading

Bug when calculating landscape analysis agglomerative clustering mean volumes #423

Bug when calculating landscape analysis agglomerative clustering mean volumes #423

Comments

michal-g commented Nov 18, 2024 • edited Loading

michal-g commented Nov 18, 2024 •

edited

Loading