You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2024-02-28 10:42:18,872 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-02-28 10:42:18,872 [INFO] Parsing data files for GSEA.............................
2024-02-28 10:42:18,872 [INFO] Enrichr library gene sets already downloaded in: /Users/kpbr532/.cache/gseapy, use local file
2024-02-28 10:42:18,880 [INFO] 0000 gene_sets have been filtered out when max_size=1000 and min_size=5
2024-02-28 10:42:18,880 [INFO] 0050 gene_sets used for further statistical testing.....
2024-02-28 10:42:18,880 [INFO] Start to run GSEA...Might take a while..................
2024-02-28 10:42:20,297 [INFO] Congratulations. GSEApy runs successfully................
Actual behaviour
In [76]: pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
...: gene_sets='MSigDB_Hallmark_2020',
...: threads=4,
...: min_size=5,
...: max_size=1000,
...: permutation_num=1000, # reduce number to speed up testing
...: outdir=None, # don't write to disk
...: seed=6,
...: verbose=True, # see what's going on behind the scenes
...: )
2024-02-28 10:42:50,362 [INFO] Input gene rankings contains duplicated IDs
KeyError Traceback (most recent call last)
Cell In[76], line 1
----> 1 pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
2 gene_sets='MSigDB_Hallmark_2020',
3 threads=4,
4 min_size=5,
5 max_size=1000,
6 permutation_num=1000, # reduce number to speed up testing
7 outdir=None, # don't write to disk
8 seed=6,
9 verbose=True, # see what's going on behind the scenes
10 )
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:418, in Prerank.load_ranking(self)
415 rank_metric = self._load_data(self.rnk) # gene id is the first column
416 if rank_metric.select_dtypes(np.number).shape[1] == 1:
417 # return series
--> 418 return self._load_ranking(rank_metric)
419 ## In case the input type multi-column ranking dataframe
420 # drop na gene id values
421 rank_metric = rank_metric.dropna(subset=rank_metric.columns[0])
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:385, in Prerank._load_ranking(self, rank_metric)
383 rank_metric.dropna(how="any", inplace=True)
384 # rename duplicate id, make them unique
--> 385 rank_metric = self.make_unique(rank_metric, col_idx=0)
386 # reset ranking index, because you have sort values and drop duplicates.
387 rank_metric.reset_index(drop=True, inplace=True)
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/grouper.py:888, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
886 in_axis, level, gpr = False, gpr, None
887 else:
--> 888 raise KeyError(gpr)
889 elif isinstance(gpr, Grouper) and gpr.key is not None:
890 # Add key to exclusions
891 exclusions.add(gpr.key)
KeyError: 0
Steps to reproduce
See my rnk file attached, I can't identify the problem with it. Please note that I had to add the txt extension at the end, otherwise github would not accept it.
Setup
I am reporting a problem with GSEApy version, Python version, and operating
system as follows:
python 3.9.6 (default, Nov 10 2023, 13:38:27)
[Clang 15.0.0 (clang-1500.1.0.2.5)]
CPython
macOS-14.1-arm64-arm-64bit
1.1.1
Expected behaviour
2024-02-28 10:42:18,872 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2024-02-28 10:42:18,872 [INFO] Parsing data files for GSEA.............................
2024-02-28 10:42:18,872 [INFO] Enrichr library gene sets already downloaded in: /Users/kpbr532/.cache/gseapy, use local file
2024-02-28 10:42:18,880 [INFO] 0000 gene_sets have been filtered out when max_size=1000 and min_size=5
2024-02-28 10:42:18,880 [INFO] 0050 gene_sets used for further statistical testing.....
2024-02-28 10:42:18,880 [INFO] Start to run GSEA...Might take a while..................
2024-02-28 10:42:20,297 [INFO] Congratulations. GSEApy runs successfully................
Actual behaviour
In [76]: pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
...: gene_sets='MSigDB_Hallmark_2020',
...: threads=4,
...: min_size=5,
...: max_size=1000,
...: permutation_num=1000, # reduce number to speed up testing
...: outdir=None, # don't write to disk
...: seed=6,
...: verbose=True, # see what's going on behind the scenes
...: )
2024-02-28 10:42:50,362 [INFO] Input gene rankings contains duplicated IDs
KeyError Traceback (most recent call last)
Cell In[76], line 1
----> 1 pre_res = gp.prerank(rnk='RNAseq.rnk', # or rnk = rnk,
2 gene_sets='MSigDB_Hallmark_2020',
3 threads=4,
4 min_size=5,
5 max_size=1000,
6 permutation_num=1000, # reduce number to speed up testing
7 outdir=None, # don't write to disk
8 seed=6,
9 verbose=True, # see what's going on behind the scenes
10 )
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/init.py:396, in prerank(rnk, gene_sets, outdir, pheno_pos, pheno_neg, min_size, max_size, permutation_num, weight, ascending, threads, figsize, format, graph_num, no_plot, seed, verbose, *arg, **kwargs)
375 weight = kwargs["weighted_score_type"]
377 pre = Prerank(
378 rnk,
379 gene_sets,
(...)
394 verbose,
395 )
--> 396 pre.run()
397 return pre
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:444, in Prerank.run(self)
441 assert self.min_size <= self.max_size
443 # parsing rankings
--> 444 dat2 = self.load_ranking()
445 assert len(dat2) > 1
446 self.ranking = dat2
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:418, in Prerank.load_ranking(self)
415 rank_metric = self._load_data(self.rnk) # gene id is the first column
416 if rank_metric.select_dtypes(np.number).shape[1] == 1:
417 # return series
--> 418 return self._load_ranking(rank_metric)
419 ## In case the input type multi-column ranking dataframe
420 # drop na gene id values
421 rank_metric = rank_metric.dropna(subset=rank_metric.columns[0])
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/gsea.py:385, in Prerank._load_ranking(self, rank_metric)
383 rank_metric.dropna(how="any", inplace=True)
384 # rename duplicate id, make them unique
--> 385 rank_metric = self.make_unique(rank_metric, col_idx=0)
386 # reset ranking index, because you have sort values and drop duplicates.
387 rank_metric.reset_index(drop=True, inplace=True)
File ~/Library/Python/3.9/lib/python/site-packages/gseapy/base.py:246, in GSEAbase.make_unique(self, rank_metric, col_idx)
243 self.logger.info("Input gene rankings contains duplicated IDs")
244 mask = rank_metric.duplicated(subset=id_col, keep=False)
245 dups = (
--> 246 rank_metric.loc[mask, id_col]
247 .groupby(id_col)
248 .cumcount()
249 .map(lambda c: "" + str(c) if c else "")
250 )
251 rank_metric.loc[mask, id_col] = rank_metric.loc[mask, id_col] + dups
252 return rank_metric
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/series.py:2076, in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
2073 raise TypeError("You have to supply one of 'by' and 'level'")
2074 axis = self._get_axis_number(axis)
-> 2076 return SeriesGroupBy(
2077 obj=self,
2078 keys=by,
2079 axis=axis,
2080 level=level,
2081 as_index=as_index,
2082 sort=sort,
2083 group_keys=group_keys,
2084 squeeze=squeeze,
2085 observed=observed,
2086 dropna=dropna,
2087 )
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/groupby.py:965, in GroupBy.init(self, obj, keys, axis, level, grouper, exclusions, selection, as_index, sort, group_keys, squeeze, observed, mutated, dropna)
962 if grouper is None:
963 from pandas.core.groupby.grouper import get_grouper
--> 965 grouper, exclusions, obj = get_grouper(
966 obj,
967 keys,
968 axis=axis,
969 level=level,
970 sort=sort,
971 observed=observed,
972 mutated=self.mutated,
973 dropna=self.dropna,
974 )
976 self.obj = obj
977 self.axis = obj._get_axis_number(axis)
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/groupby/grouper.py:888, in get_grouper(obj, key, axis, level, sort, observed, mutated, validate, dropna)
886 in_axis, level, gpr = False, gpr, None
887 else:
--> 888 raise KeyError(gpr)
889 elif isinstance(gpr, Grouper) and gpr.key is not None:
890 # Add key to exclusions
891 exclusions.add(gpr.key)
KeyError: 0
Steps to reproduce
See my rnk file attached, I can't identify the problem with it. Please note that I had to add the txt extension at the end, otherwise github would not accept it.
RNAseq.rnk.txt
The text was updated successfully, but these errors were encountered: