This repository has been archived by the owner on May 6, 2021. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 87
/
groupby.html
1637 lines (1510 loc) · 157 KB
/
groupby.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<span id="groupby"></span><h1><span class="yiyi-st" id="yiyi-76">Group By: split-apply-combine</span></h1>
<blockquote>
<p>原文:<a href="http://pandas.pydata.org/pandas-docs/stable/groupby.html">http://pandas.pydata.org/pandas-docs/stable/groupby.html</a></p>
<p>译者:<a href="https://github.com/wizardforcel">飞龙</a> <a href="http://usyiyi.cn/">UsyiyiCN</a></p>
<p>校对:(虚位以待)</p>
</blockquote>
<p><span class="yiyi-st" id="yiyi-77">“分组”是指涉及一个或多个以下步骤的过程</span></p>
<blockquote>
<div><ul class="simple">
<li><span class="yiyi-st" id="yiyi-78"><strong>根据某些条件将数据拆分成组</strong></span></li>
<li><span class="yiyi-st" id="yiyi-79">对每个组独立<strong>应用</strong>函数</span></li>
<li><span class="yiyi-st" id="yiyi-80"><strong>将</strong>结果合并到一个数据结构中</span></li>
</ul>
</div></blockquote>
<p><span class="yiyi-st" id="yiyi-81">其中,分离步骤是最直接的。</span><span class="yiyi-st" id="yiyi-82">事实上,在大多数情况下,您可能希望将数据集拆分成组,并自己对这些组执行某些操作。</span><span class="yiyi-st" id="yiyi-83">在操作的过程中,我们可能需要的功能有:</span></p>
<blockquote>
<div><ul>
<li><p class="first"><span class="yiyi-st" id="yiyi-84"><strong>汇总</strong>:计算每个组的汇总统计量(或统计值)。</span><span class="yiyi-st" id="yiyi-85">例如:</span></p>
<blockquote>
<div><ul class="simple">
<li><span class="yiyi-st" id="yiyi-86">计算每组的和或平均值</span></li>
<li><span class="yiyi-st" id="yiyi-87">计算每组的长度/计数</span></li>
</ul>
</div></blockquote>
</li>
<li><p class="first"><span class="yiyi-st" id="yiyi-88"><strong>转换</strong>:执行一些特定于组的计算并返回类似索引。</span><span class="yiyi-st" id="yiyi-89">一些例子:</span></p>
<blockquote>
<div><ul class="simple">
<li><span class="yiyi-st" id="yiyi-90">标准化组内的数据(zscore)</span></li>
<li><span class="yiyi-st" id="yiyi-91">在组内填充具有从每个组派生的值的NA</span></li>
</ul>
</div></blockquote>
</li>
<li><p class="first"><span class="yiyi-st" id="yiyi-92"><strong>过滤</strong>:根据评估True或False的按组计算,丢弃一些组。</span><span class="yiyi-st" id="yiyi-93">一些例子:</span></p>
<blockquote>
<div><ul class="simple">
<li><span class="yiyi-st" id="yiyi-94">丢弃属于只有少数成员的组的数据</span></li>
<li><span class="yiyi-st" id="yiyi-95">基于组总和或平均值过滤数据</span></li>
</ul>
</div></blockquote>
</li>
<li><p class="first"><span class="yiyi-st" id="yiyi-96">上述的一些组合:GroupBy将检查应用步骤的结果,并且如果不适合上述两个类别中的任一个,则尝试返回明智的组合结果</span></p>
</li>
</ul>
</div></blockquote>
<p><span class="yiyi-st" id="yiyi-97">由于在pandas数据结构上的对象实例方法的集合通常是丰富和表达的,所以我们通常只是想调用每个组上的DataFrame函数。</span><span class="yiyi-st" id="yiyi-98">对于使用基于SQL的工具(或<code class="docutils literal"><span class="pre">itertools</span></code>)的人来说,GroupBy的名称应该相当熟悉,您可以在其中编写代码:</span></p>
<div class="highlight-sql"><div class="highlight"><pre><span></span><span class="k">SELECT</span> <span class="n">Column1</span><span class="p">,</span> <span class="n">Column2</span><span class="p">,</span> <span class="n">mean</span><span class="p">(</span><span class="n">Column3</span><span class="p">),</span> <span class="k">sum</span><span class="p">(</span><span class="n">Column4</span><span class="p">)</span>
<span class="k">FROM</span> <span class="n">SomeTable</span>
<span class="k">GROUP</span> <span class="k">BY</span> <span class="n">Column1</span><span class="p">,</span> <span class="n">Column2</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-99">我们的目标是使这样的操作自然,容易使用Panda表达。</span><span class="yiyi-st" id="yiyi-100">我们将讨论GroupBy功能的每个领域,然后提供一些非平凡的例子/用例。</span></p>
<p><span class="yiyi-st" id="yiyi-101">有关某些高级策略,请参阅<a class="reference internal" href="cookbook.html#cookbook-grouping"><span class="std std-ref">cookbook</span></a></span></p>
<div class="section" id="splitting-an-object-into-groups">
<span id="groupby-split"></span><h2><span class="yiyi-st" id="yiyi-102">Splitting an object into groups</span></h2>
<p><span class="yiyi-st" id="yiyi-103">pandas对象可以在任何轴上分割。</span><span class="yiyi-st" id="yiyi-104">分组的抽象定义是提供标签到组名称的映射。</span><span class="yiyi-st" id="yiyi-105">要创建GroupBy对象(更多关于GroupBy对象的更多信息),请执行以下操作:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="go">>>> grouped = obj.groupby(key)</span>
<span class="go">>>> grouped = obj.groupby(key, axis=1)</span>
<span class="go">>>> grouped = obj.groupby([key1, key2])</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-106">可以以许多不同的方式指定映射:</span></p>
<blockquote>
<div><ul class="simple">
<li><span class="yiyi-st" id="yiyi-107">一个Python函数,要在每个轴标签上调用</span></li>
<li><span class="yiyi-st" id="yiyi-108">与所选轴长度相同的列表或NumPy数组</span></li>
<li><span class="yiyi-st" id="yiyi-109">提供<code class="docutils literal"><span class="pre">标签</span> <span class="pre"> - ></span> <span class="pre">组</span> <span class="pre">名称</span></code></span></li>
<li><span class="yiyi-st" id="yiyi-110">对于DataFrame对象,指示要用于分组的列的字符串。</span><span class="yiyi-st" id="yiyi-111">当然<code class="docutils literal"><span class="pre">df.groupby('A')</span></code>只是<code class="docutils literal"><span class="pre">df.groupby(df['A'])</span></code>的语法糖,但它使生活更简单</span></li>
<li><span class="yiyi-st" id="yiyi-112">任何上述事情的列表</span></li>
</ul>
</div></blockquote>
<p><span class="yiyi-st" id="yiyi-113">我们将分组对象称为<strong>键</strong>。</span><span class="yiyi-st" id="yiyi-114">例如,考虑以下DataFrame:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [1]: </span><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'A'</span> <span class="p">:</span> <span class="p">[</span><span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'bar'</span><span class="p">,</span> <span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'bar'</span><span class="p">,</span>
<span class="gp"> ...:</span> <span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'bar'</span><span class="p">,</span> <span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'foo'</span><span class="p">],</span>
<span class="gp"> ...:</span> <span class="s1">'B'</span> <span class="p">:</span> <span class="p">[</span><span class="s1">'one'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'three'</span><span class="p">,</span>
<span class="gp"> ...:</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">,</span> <span class="s1">'three'</span><span class="p">],</span>
<span class="gp"> ...:</span> <span class="s1">'C'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">8</span><span class="p">),</span>
<span class="gp"> ...:</span> <span class="s1">'D'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">8</span><span class="p">)})</span>
<span class="gp"> ...:</span>
<span class="gp">In [2]: </span><span class="n">df</span>
<span class="gr">Out[2]: </span>
<span class="go"> A B C D</span>
<span class="go">0 foo one 0.469112 -0.861849</span>
<span class="go">1 bar one -0.282863 -2.104569</span>
<span class="go">2 foo two -1.509059 -0.494929</span>
<span class="go">3 bar three -1.135632 1.071804</span>
<span class="go">4 foo two 1.212112 0.721555</span>
<span class="go">5 bar two -0.173215 -0.706771</span>
<span class="go">6 foo one 0.119209 -1.039575</span>
<span class="go">7 foo three -1.044236 0.271860</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-115">我们可以通过<code class="docutils literal"><span class="pre">A</span></code>或<code class="docutils literal"><span class="pre">B</span></code>列或两者自然分组:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [3]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>
<span class="gp">In [4]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-116">这些将在其索引(行)上拆分DataFrame。</span><span class="yiyi-st" id="yiyi-117">我们还可以按列分割:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [5]: </span><span class="k">def</span> <span class="nf">get_letter_type</span><span class="p">(</span><span class="n">letter</span><span class="p">):</span>
<span class="gp"> ...:</span> <span class="k">if</span> <span class="n">letter</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="s1">'aeiou'</span><span class="p">:</span>
<span class="gp"> ...:</span> <span class="k">return</span> <span class="s1">'vowel'</span>
<span class="gp"> ...:</span> <span class="k">else</span><span class="p">:</span>
<span class="gp"> ...:</span> <span class="k">return</span> <span class="s1">'consonant'</span>
<span class="gp"> ...:</span>
<span class="gp">In [6]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">get_letter_type</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-118">从0.8开始,pandas Index对象现在支持重复值。</span><span class="yiyi-st" id="yiyi-119">如果在groupby操作中将非唯一索引用作组键,则同一索引值的所有值将被视为在一个组中,因此聚合函数的输出将仅包含唯一索引值:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [7]: </span><span class="n">lst</span> <span class="o">=</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">]</span>
<span class="gp">In [8]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">30</span><span class="p">],</span> <span class="n">lst</span><span class="p">)</span>
<span class="gp">In [9]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">In [10]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">first</span><span class="p">()</span>
<span class="gr">Out[10]: </span>
<span class="go">1 1</span>
<span class="go">2 2</span>
<span class="go">3 3</span>
<span class="go">dtype: int64</span>
<span class="gp">In [11]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">last</span><span class="p">()</span>
<span class="gr">Out[11]: </span>
<span class="go">1 10</span>
<span class="go">2 20</span>
<span class="go">3 30</span>
<span class="go">dtype: int64</span>
<span class="gp">In [12]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[12]: </span>
<span class="go">1 11</span>
<span class="go">2 22</span>
<span class="go">3 33</span>
<span class="go">dtype: int64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-120">请注意,<strong>不发生分裂</strong>,直到需要。</span><span class="yiyi-st" id="yiyi-121">创建GroupBy对象只会验证您是否通过了有效的映射。</span></p>
<div class="admonition note">
<p class="first admonition-title"><span class="yiyi-st" id="yiyi-122">注意</span></p>
<p class="last"><span class="yiyi-st" id="yiyi-123">许多种类的复杂数据操作可以用GroupBy操作来表示(尽管不能保证是最有效的)。</span><span class="yiyi-st" id="yiyi-124">您可以使用标签映射函数获得相当的创意。</span></p>
</div>
<div class="section" id="groupby-sorting">
<span id="id1"></span><h3><span class="yiyi-st" id="yiyi-125">GroupBy sorting</span></h3>
<p><span class="yiyi-st" id="yiyi-126">默认情况下,组密钥在<code class="docutils literal"><span class="pre">groupby</span></code>操作期间排序。</span><span class="yiyi-st" id="yiyi-127">但是,您可以通过<code class="docutils literal"><span class="pre">sort=False</span></code>来获得潜在的加速:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [13]: </span><span class="n">df2</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'X'</span> <span class="p">:</span> <span class="p">[</span><span class="s1">'B'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">],</span> <span class="s1">'Y'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]})</span>
<span class="gp">In [14]: </span><span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'X'</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[14]: </span>
<span class="go"> Y</span>
<span class="go">X </span>
<span class="go">A 7</span>
<span class="go">B 3</span>
<span class="gp">In [15]: </span><span class="n">df2</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'X'</span><span class="p">],</span> <span class="n">sort</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[15]: </span>
<span class="go"> Y</span>
<span class="go">X </span>
<span class="go">B 3</span>
<span class="go">A 7</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-128">Note that <code class="docutils literal"><span class="pre">groupby</span></code> will preserve the order in which <em>observations</em> are sorted <em>within</em> each group. </span><span class="yiyi-st" id="yiyi-129">例如,以下由<code class="docutils literal"><span class="pre">groupby()</span></code>创建的组按照它们在原始<code class="docutils literal"><span class="pre">DataFrame</span></code>中显示的顺序:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [16]: </span><span class="n">df3</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'X'</span> <span class="p">:</span> <span class="p">[</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">,</span> <span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">],</span> <span class="s1">'Y'</span> <span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">]})</span>
<span class="gp">In [17]: </span><span class="n">df3</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'X'</span><span class="p">])</span><span class="o">.</span><span class="n">get_group</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>
<span class="gr">Out[17]: </span>
<span class="go"> X Y</span>
<span class="go">0 A 1</span>
<span class="go">2 A 3</span>
<span class="gp">In [18]: </span><span class="n">df3</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'X'</span><span class="p">])</span><span class="o">.</span><span class="n">get_group</span><span class="p">(</span><span class="s1">'B'</span><span class="p">)</span>
<span class="gr">Out[18]: </span>
<span class="go"> X Y</span>
<span class="go">1 B 4</span>
<span class="go">3 B 2</span>
</pre></div>
</div>
</div>
<div class="section" id="groupby-object-attributes">
<span id="groupby-attributes"></span><h3><span class="yiyi-st" id="yiyi-130">GroupBy object attributes</span></h3>
<p><span class="yiyi-st" id="yiyi-131"><code class="docutils literal"><span class="pre">groups</span></code>属性是dict,其键是计算的唯一组,对应的值是属于每个组的轴标签。</span><span class="yiyi-st" id="yiyi-132">在上面的例子中,我们有:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [19]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span><span class="o">.</span><span class="n">groups</span>
<span class="gr">Out[19]: </span>
<span class="go">{'bar': Int64Index([1, 3, 5], dtype='int64'),</span>
<span class="go"> 'foo': Int64Index([0, 2, 4, 6, 7], dtype='int64')}</span>
<span class="gp">In [20]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">get_letter_type</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="n">groups</span>
<span class="gr">Out[20]: </span>
<span class="go">{'consonant': Index([u'B', u'C', u'D'], dtype='object'),</span>
<span class="go"> 'vowel': Index([u'A'], dtype='object')}</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-133">调用GroupBy对象上的标准Python <code class="docutils literal"><span class="pre">len</span></code>函数只返回<code class="docutils literal"><span class="pre">groups</span></code> dict的长度,因此很大程度上只是一个方便:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [21]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span>
<span class="gp">In [22]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">groups</span>
<span class="gr">Out[22]: </span>
<span class="go">{('bar', 'one'): Int64Index([1], dtype='int64'),</span>
<span class="go"> ('bar', 'three'): Int64Index([3], dtype='int64'),</span>
<span class="go"> ('bar', 'two'): Int64Index([5], dtype='int64'),</span>
<span class="go"> ('foo', 'one'): Int64Index([0, 6], dtype='int64'),</span>
<span class="go"> ('foo', 'three'): Int64Index([7], dtype='int64'),</span>
<span class="go"> ('foo', 'two'): Int64Index([2, 4], dtype='int64')}</span>
<span class="gp">In [23]: </span><span class="nb">len</span><span class="p">(</span><span class="n">grouped</span><span class="p">)</span>
<span class="gr">Out[23]: </span><span class="mi">6</span>
</pre></div>
</div>
<p id="groupby-tabcompletion"><span class="yiyi-st" id="yiyi-134"><code class="docutils literal"><span class="pre">GroupBy</span></code>将标签完成列名称(和其他属性)</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [24]: </span><span class="n">df</span>
<span class="gr">Out[24]: </span>
<span class="go"> gender height weight</span>
<span class="go">2000-01-01 male 42.849980 157.500553</span>
<span class="go">2000-01-02 male 49.607315 177.340407</span>
<span class="go">2000-01-03 male 56.293531 171.524640</span>
<span class="go">2000-01-04 female 48.421077 144.251986</span>
<span class="go">2000-01-05 male 46.556882 152.526206</span>
<span class="go">2000-01-06 female 68.448851 168.272968</span>
<span class="go">2000-01-07 male 70.757698 136.431469</span>
<span class="go">2000-01-08 female 58.909500 176.499753</span>
<span class="go">2000-01-09 female 76.435631 174.094104</span>
<span class="go">2000-01-10 male 45.306120 177.540920</span>
<span class="gp">In [25]: </span><span class="n">gb</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'gender'</span><span class="p">)</span>
</pre></div>
</div>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [26]: </span><span class="n">gb</span><span class="o">.<</span><span class="n">TAB</span><span class="o">></span>
<span class="go">gb.agg gb.boxplot gb.cummin gb.describe gb.filter gb.get_group gb.height gb.last gb.median gb.ngroups gb.plot gb.rank gb.std gb.transform</span>
<span class="go">gb.aggregate gb.count gb.cumprod gb.dtype gb.first gb.groups gb.hist gb.max gb.min gb.nth gb.prod gb.resample gb.sum gb.var</span>
<span class="go">gb.apply gb.cummax gb.cumsum gb.fillna gb.gender gb.head gb.indices gb.mean gb.name gb.ohlc gb.quantile gb.size gb.tail gb.weight</span>
</pre></div>
</div>
</div>
<div class="section" id="groupby-with-multiindex">
<span id="groupby-multiindex"></span><h3><span class="yiyi-st" id="yiyi-135">GroupBy with MultiIndex</span></h3>
<p><span class="yiyi-st" id="yiyi-136">使用<a class="reference internal" href="advanced.html#advanced-hierarchical"><span class="std std-ref">hierarchically-indexed data</span></a>,按层次结构的一个级别分组是很自然的。</span></p>
<p><span class="yiyi-st" id="yiyi-137">让我们创建一个具有两级<code class="docutils literal"><span class="pre">MultiIndex</span></code>的系列。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [27]: </span><span class="n">arrays</span> <span class="o">=</span> <span class="p">[[</span><span class="s1">'bar'</span><span class="p">,</span> <span class="s1">'bar'</span><span class="p">,</span> <span class="s1">'baz'</span><span class="p">,</span> <span class="s1">'baz'</span><span class="p">,</span> <span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'foo'</span><span class="p">,</span> <span class="s1">'qux'</span><span class="p">,</span> <span class="s1">'qux'</span><span class="p">],</span>
<span class="gp"> ....:</span> <span class="p">[</span><span class="s1">'one'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">,</span> <span class="s1">'two'</span><span class="p">]]</span>
<span class="gp"> ....:</span>
<span class="gp">In [28]: </span><span class="n">index</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">MultiIndex</span><span class="o">.</span><span class="n">from_arrays</span><span class="p">(</span><span class="n">arrays</span><span class="p">,</span> <span class="n">names</span><span class="o">=</span><span class="p">[</span><span class="s1">'first'</span><span class="p">,</span> <span class="s1">'second'</span><span class="p">])</span>
<span class="gp">In [29]: </span><span class="n">s</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">8</span><span class="p">),</span> <span class="n">index</span><span class="o">=</span><span class="n">index</span><span class="p">)</span>
<span class="gp">In [30]: </span><span class="n">s</span>
<span class="gr">Out[30]: </span>
<span class="go">first second</span>
<span class="go">bar one -0.575247</span>
<span class="go"> two 0.254161</span>
<span class="go">baz one -1.143704</span>
<span class="go"> two 0.215897</span>
<span class="go">foo one 1.193555</span>
<span class="go"> two -0.077118</span>
<span class="go">qux one -0.408530</span>
<span class="go"> two -0.862495</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-138">然后,我们可以按<code class="docutils literal"><span class="pre">s</span></code>中的一个级别分组。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [31]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">s</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="gp">In [32]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[32]: </span>
<span class="go">first</span>
<span class="go">bar -0.321085</span>
<span class="go">baz -0.927807</span>
<span class="go">foo 1.116437</span>
<span class="go">qux -1.271025</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-139">如果MultiIndex具有指定的名称,则可以传递这些名称,而不是级别号码:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [33]: </span><span class="n">s</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'second'</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[33]: </span>
<span class="go">second</span>
<span class="go">one -0.933926</span>
<span class="go">two -0.469555</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-140">聚合函数(如<code class="docutils literal"><span class="pre">sum</span></code>)将直接采用级别参数。</span><span class="yiyi-st" id="yiyi-141">此外,生成的索引将根据所选级别命名:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [34]: </span><span class="n">s</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="s1">'second'</span><span class="p">)</span>
<span class="gr">Out[34]: </span>
<span class="go">second</span>
<span class="go">one -0.933926</span>
<span class="go">two -0.469555</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-142">同样从v0.6,支持与多个级别的分组。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [35]: </span><span class="n">s</span>
<span class="gr">Out[35]: </span>
<span class="go">first second third</span>
<span class="go">bar doo one 1.346061</span>
<span class="go"> two 1.511763</span>
<span class="go">baz bee one 1.627081</span>
<span class="go"> two -0.990582</span>
<span class="go">foo bop one -0.441652</span>
<span class="go"> two 1.211526</span>
<span class="go">qux bop one 0.268520</span>
<span class="go"> two 0.024580</span>
<span class="go">dtype: float64</span>
<span class="gp">In [36]: </span><span class="n">s</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="p">[</span><span class="s1">'first'</span><span class="p">,</span> <span class="s1">'second'</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[36]: </span>
<span class="go">first second</span>
<span class="go">bar doo 2.857824</span>
<span class="go">baz bee 0.636499</span>
<span class="go">foo bop 0.769873</span>
<span class="go">qux bop 0.293100</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-143">稍后关于<code class="docutils literal"><span class="pre">sum</span></code>函数和聚合的更多信息。</span></p>
</div>
<div class="section" id="dataframe-column-selection-in-groupby">
<h3><span class="yiyi-st" id="yiyi-144">DataFrame column selection in GroupBy</span></h3>
<p><span class="yiyi-st" id="yiyi-145">例如,从DataFrame创建GroupBy对象后,您可能需要对每个列执行不同的操作。</span><span class="yiyi-st" id="yiyi-146">因此,使用<code class="docutils literal"><span class="pre">[]</span></code>类似于从DataFrame获取列,您可以:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [37]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">])</span>
<span class="gp">In [38]: </span><span class="n">grouped_C</span> <span class="o">=</span> <span class="n">grouped</span><span class="p">[</span><span class="s1">'C'</span><span class="p">]</span>
<span class="gp">In [39]: </span><span class="n">grouped_D</span> <span class="o">=</span> <span class="n">grouped</span><span class="p">[</span><span class="s1">'D'</span><span class="p">]</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-147">这主要是语法糖替代和更冗长:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [40]: </span><span class="n">df</span><span class="p">[</span><span class="s1">'C'</span><span class="p">]</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="s1">'A'</span><span class="p">])</span>
<span class="gr">Out[40]: </span><span class="o"><</span><span class="n">pandas</span><span class="o">.</span><span class="n">core</span><span class="o">.</span><span class="n">groupby</span><span class="o">.</span><span class="n">SeriesGroupBy</span> <span class="nb">object</span> <span class="n">at</span> <span class="mh">0x7ff26f58b810</span><span class="o">></span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-148">此外,此方法避免重新计算从传递的密钥导出的内部分组信息。</span></p>
</div>
</div>
<div class="section" id="iterating-through-groups">
<span id="groupby-iterating"></span><h2><span class="yiyi-st" id="yiyi-149">Iterating through groups</span></h2>
<p><span class="yiyi-st" id="yiyi-150">使用GroupBy对象,迭代分组数据是非常自然的,其功能类似于<code class="docutils literal"><span class="pre">itertools.groupby</span></code>:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [41]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>
<span class="gp">In [42]: </span><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">grouped</span><span class="p">:</span>
<span class="gp"> ....:</span> <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
<span class="gp"> ....:</span> <span class="k">print</span><span class="p">(</span><span class="n">group</span><span class="p">)</span>
<span class="gp"> ....:</span>
<span class="go">bar</span>
<span class="go"> A B C D</span>
<span class="go">1 bar one -0.042379 -0.089329</span>
<span class="go">3 bar three -0.009920 -0.945867</span>
<span class="go">5 bar two 0.495767 1.956030</span>
<span class="go">foo</span>
<span class="go"> A B C D</span>
<span class="go">0 foo one -0.919854 -1.131345</span>
<span class="go">2 foo two 1.247642 0.337863</span>
<span class="go">4 foo two 0.290213 -0.932132</span>
<span class="go">6 foo one 0.362949 0.017587</span>
<span class="go">7 foo three 1.548106 -0.016692</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-151">在通过多个键进行分组的情况下,组名称将是元组:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [43]: </span><span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">group</span> <span class="ow">in</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">]):</span>
<span class="gp"> ....:</span> <span class="k">print</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
<span class="gp"> ....:</span> <span class="k">print</span><span class="p">(</span><span class="n">group</span><span class="p">)</span>
<span class="gp"> ....:</span>
<span class="go">('bar', 'one')</span>
<span class="go"> A B C D</span>
<span class="go">1 bar one -0.042379 -0.089329</span>
<span class="go">('bar', 'three')</span>
<span class="go"> A B C D</span>
<span class="go">3 bar three -0.00992 -0.945867</span>
<span class="go">('bar', 'two')</span>
<span class="go"> A B C D</span>
<span class="go">5 bar two 0.495767 1.95603</span>
<span class="go">('foo', 'one')</span>
<span class="go"> A B C D</span>
<span class="go">0 foo one -0.919854 -1.131345</span>
<span class="go">6 foo one 0.362949 0.017587</span>
<span class="go">('foo', 'three')</span>
<span class="go"> A B C D</span>
<span class="go">7 foo three 1.548106 -0.016692</span>
<span class="go">('foo', 'two')</span>
<span class="go"> A B C D</span>
<span class="go">2 foo two 1.247642 0.337863</span>
<span class="go">4 foo two 0.290213 -0.932132</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-152">它是标准的Python-fu但是记住你可以在for循环语句中解压缩元组:<code class="docutils literal"><span class="pre"></span> <span class="pre">(k1,</span> <span class="pre">k2) t3> <span class="pre">group</span> <span class="pre">in</span> <span class="pre">分组:</span></span></code>。</span></p>
</div>
<div class="section" id="selecting-a-group">
<h2><span class="yiyi-st" id="yiyi-153">Selecting a group</span></h2>
<p><span class="yiyi-st" id="yiyi-154">可以使用<code class="docutils literal"><span class="pre">GroupBy.get_group()</span></code>选择单个组:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [44]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">get_group</span><span class="p">(</span><span class="s1">'bar'</span><span class="p">)</span>
<span class="gr">Out[44]: </span>
<span class="go"> A B C D</span>
<span class="go">1 bar one -0.042379 -0.089329</span>
<span class="go">3 bar three -0.009920 -0.945867</span>
<span class="go">5 bar two 0.495767 1.956030</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-155">或者对于在多个列上分组的对象:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [45]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span><span class="o">.</span><span class="n">get_group</span><span class="p">((</span><span class="s1">'bar'</span><span class="p">,</span> <span class="s1">'one'</span><span class="p">))</span>
<span class="gr">Out[45]: </span>
<span class="go"> A B C D</span>
<span class="go">1 bar one -0.042379 -0.089329</span>
</pre></div>
</div>
</div>
<div class="section" id="aggregation">
<span id="groupby-aggregate"></span><h2><span class="yiyi-st" id="yiyi-156">Aggregation</span></h2>
<p><span class="yiyi-st" id="yiyi-157">一旦GroupBy对象被创建,几种方法可用于对分组的数据执行计算。</span></p>
<p><span class="yiyi-st" id="yiyi-158">一个明显的是通过<code class="docutils literal"><span class="pre">aggregate</span></code>或等效地<code class="docutils literal"><span class="pre">agg</span></code>方法的聚合:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [46]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>
<span class="gp">In [47]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">)</span>
<span class="gr">Out[47]: </span>
<span class="go"> C D</span>
<span class="go">A </span>
<span class="go">bar 0.443469 0.920834</span>
<span class="go">foo 2.529056 -1.724719</span>
<span class="gp">In [48]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span>
<span class="gp">In [49]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">)</span>
<span class="gr">Out[49]: </span>
<span class="go"> C D</span>
<span class="go">A B </span>
<span class="go">bar one -0.042379 -0.089329</span>
<span class="go"> three -0.009920 -0.945867</span>
<span class="go"> two 0.495767 1.956030</span>
<span class="go">foo one -0.556905 -1.113758</span>
<span class="go"> three 1.548106 -0.016692</span>
<span class="go"> two 1.537855 -0.594269</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-159">如您所见,聚合的结果将组名称作为分组轴上的新索引。</span><span class="yiyi-st" id="yiyi-160">在多个键的情况下,默认情况下,结果为<a class="reference internal" href="advanced.html#advanced-hierarchical"><span class="std std-ref">MultiIndex</span></a>,但可以使用<code class="docutils literal"><span class="pre">as_index</span></code>选项更改:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [50]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">],</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="gp">In [51]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">aggregate</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">)</span>
<span class="gr">Out[51]: </span>
<span class="go"> A B C D</span>
<span class="go">0 bar one -0.042379 -0.089329</span>
<span class="go">1 bar three -0.009920 -0.945867</span>
<span class="go">2 bar two 0.495767 1.956030</span>
<span class="go">3 foo one -0.556905 -1.113758</span>
<span class="go">4 foo three 1.548106 -0.016692</span>
<span class="go">5 foo two 1.537855 -0.594269</span>
<span class="gp">In [52]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">,</span> <span class="n">as_index</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[52]: </span>
<span class="go"> A C D</span>
<span class="go">0 bar 0.443469 0.920834</span>
<span class="go">1 foo 2.529056 -1.724719</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-161">请注意,您可以使用<code class="docutils literal"><span class="pre">reset_index</span></code> DataFrame函数来实现与列名存储在结果<code class="docutils literal"><span class="pre">MultiIndex</span></code>中相同的结果:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [53]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span><span class="o">.</span><span class="n">reset_index</span><span class="p">()</span>
<span class="gr">Out[53]: </span>
<span class="go"> A B C D</span>
<span class="go">0 bar one -0.042379 -0.089329</span>
<span class="go">1 bar three -0.009920 -0.945867</span>
<span class="go">2 bar two 0.495767 1.956030</span>
<span class="go">3 foo one -0.556905 -1.113758</span>
<span class="go">4 foo three 1.548106 -0.016692</span>
<span class="go">5 foo two 1.537855 -0.594269</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-162">另一个简单的聚合示例是计算每个组的大小。</span><span class="yiyi-st" id="yiyi-163">这作为<code class="docutils literal"><span class="pre">size</span></code>方法包含在GroupBy中。</span><span class="yiyi-st" id="yiyi-164">它返回一个系列,其索引是组名称,其值是每个组的大小。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [54]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="gr">Out[54]: </span>
<span class="go">A B </span>
<span class="go">bar one 1</span>
<span class="go"> three 1</span>
<span class="go"> two 1</span>
<span class="go">foo one 2</span>
<span class="go"> three 1</span>
<span class="go"> two 2</span>
<span class="go">dtype: int64</span>
</pre></div>
</div>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [55]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
<span class="gr">Out[55]: </span>
<span class="go"> C D</span>
<span class="go">0 count 1.000000 1.000000</span>
<span class="go"> mean -0.042379 -0.089329</span>
<span class="go"> std NaN NaN</span>
<span class="go"> min -0.042379 -0.089329</span>
<span class="go"> 25% -0.042379 -0.089329</span>
<span class="go"> 50% -0.042379 -0.089329</span>
<span class="go"> 75% -0.042379 -0.089329</span>
<span class="go">... ... ...</span>
<span class="go">5 mean 0.768928 -0.297134</span>
<span class="go"> std 0.677005 0.898022</span>
<span class="go"> min 0.290213 -0.932132</span>
<span class="go"> 25% 0.529570 -0.614633</span>
<span class="go"> 50% 0.768928 -0.297134</span>
<span class="go"> 75% 1.008285 0.020364</span>
<span class="go"> max 1.247642 0.337863</span>
<span class="go">[48 rows x 2 columns]</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title"><span class="yiyi-st" id="yiyi-165">注意</span></p>
<p><span class="yiyi-st" id="yiyi-166">如果<code class="docutils literal"><span class="pre">as_index=True</span></code>(默认值),聚合函数<strong>不会</strong>返回您聚合的组,如果他们被命名为<em></em></span><span class="yiyi-st" id="yiyi-167">分组的列将是返回对象的<strong>indices</strong>。</span></p>
<p><span class="yiyi-st" id="yiyi-168">传递<code class="docutils literal"><span class="pre">as_index=False</span></code> <strong>将</strong>返回您要聚合的组(如果它们命名为<em>列</em>)。</span></p>
<p><span class="yiyi-st" id="yiyi-169">聚合函数是减少返回对象的维度的函数,例如:<code class="docutils literal"><span class="pre">mean,</span> <span class="pre">sum,</span> <span class="pre">size,</span> <span class="pre">count ,</span> <span class="pre">std,</span> <span class="pre">var,</span> <span class="pre">sem,</span> <span class="pre">describe,</span> <span class="pre"><span class="pre">last,</span> <span class="pre">nth,</span> <span class="pre">min,</span> <span class="pre">max</span></span></code>。</span><span class="yiyi-st" id="yiyi-170">这是当你做例如<code class="docutils literal"><span class="pre">DataFrame.sum()</span></code>并得到一个<code class="docutils literal"><span class="pre">Series</span></code>时会发生什么。</span></p>
<p class="last"><span class="yiyi-st" id="yiyi-171"><code class="docutils literal"><span class="pre">nth</span></code>可以用作减速器<em>或</em>过滤器,请参阅<a class="reference internal" href="#groupby-nth"><span class="std std-ref">here</span></a></span></p>
</div>
<div class="section" id="applying-multiple-functions-at-once">
<span id="groupby-aggregate-multifunc"></span><h3><span class="yiyi-st" id="yiyi-172">Applying multiple functions at once</span></h3>
<p><span class="yiyi-st" id="yiyi-173">使用分组系列,您还可以传递函数的列表或字典以进行聚合,输出DataFrame:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [56]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>
<span class="gp">In [57]: </span><span class="n">grouped</span><span class="p">[</span><span class="s1">'C'</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">])</span>
<span class="gr">Out[57]: </span>
<span class="go"> sum mean std</span>
<span class="go">A </span>
<span class="go">bar 0.443469 0.147823 0.301765</span>
<span class="go">foo 2.529056 0.505811 0.966450</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-174">如果传递了dict,则键将用于命名列。</span><span class="yiyi-st" id="yiyi-175">否则将使用函数的名称(存储在函数对象中)。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [58]: </span><span class="n">grouped</span><span class="p">[</span><span class="s1">'D'</span><span class="p">]</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'result1'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">,</span>
<span class="gp"> ....:</span> <span class="s1">'result2'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">})</span>
<span class="gp"> ....:</span>
<span class="gr">Out[58]: </span>
<span class="go"> result2 result1</span>
<span class="go">A </span>
<span class="go">bar 0.306945 0.920834</span>
<span class="go">foo -0.344944 -1.724719</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-176">在分组的DataFrame上,您可以传递要应用于每个列的函数列表,这会生成具有层次索引的聚合结果:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [59]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">([</span><span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">])</span>
<span class="gr">Out[59]: </span>
<span class="go"> C D </span>
<span class="go"> sum mean std sum mean std</span>
<span class="go">A </span>
<span class="go">bar 0.443469 0.147823 0.301765 0.920834 0.306945 1.490982</span>
<span class="go">foo 2.529056 0.505811 0.966450 -1.724719 -0.344944 0.645875</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-177">默认情况下,传递函数的dict具有不同的行为,请参见下一节。</span></p>
</div>
<div class="section" id="applying-different-functions-to-dataframe-columns">
<h3><span class="yiyi-st" id="yiyi-178">Applying different functions to DataFrame columns</span></h3>
<p><span class="yiyi-st" id="yiyi-179">通过将dict传递到<code class="docutils literal"><span class="pre">aggregate</span></code>,您可以对DataFrame的列应用不同的聚合:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [60]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'C'</span> <span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">sum</span><span class="p">,</span>
<span class="gp"> ....:</span> <span class="s1">'D'</span> <span class="p">:</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ddof</span><span class="o">=</span><span class="mi">1</span><span class="p">)})</span>
<span class="gp"> ....:</span>
<span class="gr">Out[60]: </span>
<span class="go"> C D</span>
<span class="go">A </span>
<span class="go">bar 0.443469 1.490982</span>
<span class="go">foo 2.529056 0.645875</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-180">函数名也可以是字符串。</span><span class="yiyi-st" id="yiyi-181">为了使字符串有效,它必须在GroupBy上实现或通过<a class="reference internal" href="#groupby-dispatch"><span class="std std-ref">dispatching</span></a>可用:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [61]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'C'</span> <span class="p">:</span> <span class="s1">'sum'</span><span class="p">,</span> <span class="s1">'D'</span> <span class="p">:</span> <span class="s1">'std'</span><span class="p">})</span>
<span class="gr">Out[61]: </span>
<span class="go"> C D</span>
<span class="go">A </span>
<span class="go">bar 0.443469 1.490982</span>
<span class="go">foo 2.529056 0.645875</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title"><span class="yiyi-st" id="yiyi-182">注意</span></p>
<p class="last"><span class="yiyi-st" id="yiyi-183">如果将dict传递到<code class="docutils literal"><span class="pre">aggregate</span></code>,则输出列的顺序是非确定性的。</span><span class="yiyi-st" id="yiyi-184">如果您想确保输出列按特定顺序排列,您可以使用<code class="docutils literal"><span class="pre">OrderedDict</span></code>。</span><span class="yiyi-st" id="yiyi-185">比较以下两个命令的输出:</span></p>
</div>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [62]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">({</span><span class="s1">'D'</span><span class="p">:</span> <span class="s1">'std'</span><span class="p">,</span> <span class="s1">'C'</span><span class="p">:</span> <span class="s1">'mean'</span><span class="p">})</span>
<span class="gr">Out[62]: </span>
<span class="go"> C D</span>
<span class="go">A </span>
<span class="go">bar 0.147823 1.490982</span>
<span class="go">foo 0.505811 0.645875</span>
<span class="gp">In [63]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">agg</span><span class="p">(</span><span class="n">OrderedDict</span><span class="p">([(</span><span class="s1">'D'</span><span class="p">,</span> <span class="s1">'std'</span><span class="p">),</span> <span class="p">(</span><span class="s1">'C'</span><span class="p">,</span> <span class="s1">'mean'</span><span class="p">)]))</span>
<span class="gr">Out[63]: </span>
<span class="go"> D C</span>
<span class="go">A </span>
<span class="go">bar 1.490982 0.147823</span>
<span class="go">foo 0.645875 0.505811</span>
</pre></div>
</div>
</div>
<div class="section" id="cython-optimized-aggregation-functions">
<span id="groupby-aggregate-cython"></span><h3><span class="yiyi-st" id="yiyi-186">Cython-optimized aggregation functions</span></h3>
<p><span class="yiyi-st" id="yiyi-187">一些常见的聚合,目前只有<code class="docutils literal"><span class="pre">sum</span></code>,<code class="docutils literal"><span class="pre">mean</span></code>,<code class="docutils literal"><span class="pre">std</span></code>和<code class="docutils literal"><span class="pre">sem</span></code></span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [64]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[64]: </span>
<span class="go"> C D</span>
<span class="go">A </span>
<span class="go">bar 0.443469 0.920834</span>
<span class="go">foo 2.529056 -1.724719</span>
<span class="gp">In [65]: </span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="s1">'A'</span><span class="p">,</span> <span class="s1">'B'</span><span class="p">])</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gr">Out[65]: </span>
<span class="go"> C D</span>
<span class="go">A B </span>
<span class="go">bar one -0.042379 -0.089329</span>
<span class="go"> three -0.009920 -0.945867</span>
<span class="go"> two 0.495767 1.956030</span>
<span class="go">foo one -0.278452 -0.556879</span>
<span class="go"> three 1.548106 -0.016692</span>
<span class="go"> two 0.768928 -0.297134</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-188">当然,在pandas对象上实现<code class="docutils literal"><span class="pre">sum</span></code>和<code class="docutils literal"><span class="pre">mean</span></code>,所以上面的代码即使没有特殊的版本也可以通过dispatching(见下文)。</span></p>
</div>
</div>
<div class="section" id="transformation">
<span id="groupby-transform"></span><h2><span class="yiyi-st" id="yiyi-189">Transformation</span></h2>
<p><span class="yiyi-st" id="yiyi-190"><code class="docutils literal"><span class="pre">transform</span></code>方法返回一个对象,其索引与被分组的对象相同(大小相同)。</span><span class="yiyi-st" id="yiyi-191">因此,传递的变换函数应返回与组块大小相同的结果。</span><span class="yiyi-st" id="yiyi-192">例如,假设我们希望标准化每个组中的数据:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [66]: </span><span class="n">index</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="s1">'10/1/1999'</span><span class="p">,</span> <span class="n">periods</span><span class="o">=</span><span class="mi">1100</span><span class="p">)</span>
<span class="gp">In [67]: </span><span class="n">ts</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mf">0.5</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1100</span><span class="p">),</span> <span class="n">index</span><span class="p">)</span>
<span class="gp">In [68]: </span><span class="n">ts</span> <span class="o">=</span> <span class="n">ts</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="n">window</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span><span class="n">min_periods</span><span class="o">=</span><span class="mi">100</span><span class="p">)</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="o">.</span><span class="n">dropna</span><span class="p">()</span>
<span class="gp">In [69]: </span><span class="n">ts</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
<span class="gr">Out[69]: </span>
<span class="go">2000-01-08 0.779333</span>
<span class="go">2000-01-09 0.778852</span>
<span class="go">2000-01-10 0.786476</span>
<span class="go">2000-01-11 0.782797</span>
<span class="go">2000-01-12 0.798110</span>
<span class="go">Freq: D, dtype: float64</span>
<span class="gp">In [70]: </span><span class="n">ts</span><span class="o">.</span><span class="n">tail</span><span class="p">()</span>
<span class="gr">Out[70]: </span>
<span class="go">2002-09-30 0.660294</span>
<span class="go">2002-10-01 0.631095</span>
<span class="go">2002-10-02 0.673601</span>
<span class="go">2002-10-03 0.709213</span>
<span class="go">2002-10-04 0.719369</span>
<span class="go">Freq: D, dtype: float64</span>
<span class="gp">In [71]: </span><span class="n">key</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">year</span>
<span class="gp">In [72]: </span><span class="n">zscore</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="n">x</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
<span class="gp">In [73]: </span><span class="n">transformed</span> <span class="o">=</span> <span class="n">ts</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">key</span><span class="p">)</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">zscore</span><span class="p">)</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-193">我们期望结果现在在每个组内具有平均值0和标准偏差1,这可以容易地检查:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="c"># Original Data</span>
<span class="gp">In [74]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">ts</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="gp">In [75]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gr">Out[75]: </span>
<span class="go">2000 0.442441</span>
<span class="go">2001 0.526246</span>
<span class="go">2002 0.459365</span>
<span class="go">dtype: float64</span>
<span class="gp">In [76]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
<span class="gr">Out[76]: </span>
<span class="go">2000 0.131752</span>
<span class="go">2001 0.210945</span>
<span class="go">2002 0.128753</span>
<span class="go">dtype: float64</span>
<span class="c"># Transformed Data</span>
<span class="gp">In [77]: </span><span class="n">grouped_trans</span> <span class="o">=</span> <span class="n">transformed</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="gp">In [78]: </span><span class="n">grouped_trans</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gr">Out[78]: </span>
<span class="go">2000 1.168208e-15</span>
<span class="go">2001 1.454544e-15</span>
<span class="go">2002 1.726657e-15</span>
<span class="go">dtype: float64</span>
<span class="gp">In [79]: </span><span class="n">grouped_trans</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
<span class="gr">Out[79]: </span>
<span class="go">2000 1.0</span>
<span class="go">2001 1.0</span>
<span class="go">2002 1.0</span>
<span class="go">dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-194">我们还可以直观地比较原始数据集和转换后的数据集。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [80]: </span><span class="n">compare</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'Original'</span><span class="p">:</span> <span class="n">ts</span><span class="p">,</span> <span class="s1">'Transformed'</span><span class="p">:</span> <span class="n">transformed</span><span class="p">})</span>
<span class="gp">In [81]: </span><span class="n">compare</span><span class="o">.</span><span class="n">plot</span><span class="p">()</span>
<span class="gr">Out[81]: </span><span class="o"><</span><span class="n">matplotlib</span><span class="o">.</span><span class="n">axes</span><span class="o">.</span><span class="n">_subplots</span><span class="o">.</span><span class="n">AxesSubplot</span> <span class="n">at</span> <span class="mh">0x7ff26ffe62d0</span><span class="o">></span>
</pre></div>
</div>
<img alt="http://pandas.pydata.org/pandas-docs/version/0.19.2/_images/groupby_transform_plot.png" src="http://pandas.pydata.org/pandas-docs/version/0.19.2/_images/groupby_transform_plot.png">
<p><span class="yiyi-st" id="yiyi-195">另一个常见的数据转换是用群平均替换丢失的数据。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [82]: </span><span class="n">data_df</span>
<span class="gr">Out[82]: </span>
<span class="go"> A B C</span>
<span class="go">0 1.539708 -1.166480 0.533026</span>
<span class="go">1 1.302092 -0.505754 NaN</span>
<span class="go">2 -0.371983 1.104803 -0.651520</span>
<span class="go">3 -1.309622 1.118697 -1.161657</span>
<span class="go">4 -1.924296 0.396437 0.812436</span>
<span class="go">5 0.815643 0.367816 -0.469478</span>
<span class="go">6 -0.030651 1.376106 -0.645129</span>
<span class="go">.. ... ... ...</span>
<span class="go">993 0.012359 0.554602 -1.976159</span>
<span class="go">994 0.042312 -1.628835 1.013822</span>
<span class="go">995 -0.093110 0.683847 -0.774753</span>
<span class="go">996 -0.185043 1.438572 NaN</span>
<span class="go">997 -0.394469 -0.642343 0.011374</span>
<span class="go">998 -1.174126 1.857148 NaN</span>
<span class="go">999 0.234564 0.517098 0.393534</span>
<span class="go">[1000 rows x 3 columns]</span>
<span class="gp">In [83]: </span><span class="n">countries</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="s1">'US'</span><span class="p">,</span> <span class="s1">'UK'</span><span class="p">,</span> <span class="s1">'GR'</span><span class="p">,</span> <span class="s1">'JP'</span><span class="p">])</span>
<span class="gp">In [84]: </span><span class="n">key</span> <span class="o">=</span> <span class="n">countries</span><span class="p">[</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">1000</span><span class="p">)]</span>
<span class="gp">In [85]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">data_df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="c"># Non-NA count in each group</span>
<span class="gp">In [86]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
<span class="gr">Out[86]: </span>
<span class="go"> A B C</span>
<span class="go">GR 209 217 189</span>
<span class="go">JP 240 255 217</span>
<span class="go">UK 216 231 193</span>
<span class="go">US 239 250 217</span>
<span class="gp">In [87]: </span><span class="n">f</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span>
<span class="gp">In [88]: </span><span class="n">transformed</span> <span class="o">=</span> <span class="n">grouped</span><span class="o">.</span><span class="n">transform</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-196">我们可以验证组平均值在变换的数据中没有变化,并且变换的数据不包含NA。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [89]: </span><span class="n">grouped_trans</span> <span class="o">=</span> <span class="n">transformed</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">key</span><span class="p">)</span>
<span class="gp">In [90]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="c1"># original group means</span>
<span class="gr">Out[90]: </span>
<span class="go"> A B C</span>
<span class="go">GR -0.098371 -0.015420 0.068053</span>
<span class="go">JP 0.069025 0.023100 -0.077324</span>
<span class="go">UK 0.034069 -0.052580 -0.116525</span>
<span class="go">US 0.058664 -0.020399 0.028603</span>
<span class="gp">In [91]: </span><span class="n">grouped_trans</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="c1"># transformation did not change group means</span>
<span class="gr">Out[91]: </span>
<span class="go"> A B C</span>
<span class="go">GR -0.098371 -0.015420 0.068053</span>
<span class="go">JP 0.069025 0.023100 -0.077324</span>
<span class="go">UK 0.034069 -0.052580 -0.116525</span>
<span class="go">US 0.058664 -0.020399 0.028603</span>
<span class="gp">In [92]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c1"># original has some missing data points</span>
<span class="gr">Out[92]: </span>
<span class="go"> A B C</span>
<span class="go">GR 209 217 189</span>
<span class="go">JP 240 255 217</span>
<span class="go">UK 216 231 193</span>
<span class="go">US 239 250 217</span>
<span class="gp">In [93]: </span><span class="n">grouped_trans</span><span class="o">.</span><span class="n">count</span><span class="p">()</span> <span class="c1"># counts after transformation</span>
<span class="gr">Out[93]: </span>
<span class="go"> A B C</span>
<span class="go">GR 228 228 228</span>
<span class="go">JP 267 267 267</span>
<span class="go">UK 247 247 247</span>
<span class="go">US 258 258 258</span>
<span class="gp">In [94]: </span><span class="n">grouped_trans</span><span class="o">.</span><span class="n">size</span><span class="p">()</span> <span class="c1"># Verify non-NA count equals group size</span>
<span class="gr">Out[94]: </span>
<span class="go">GR 228</span>
<span class="go">JP 267</span>
<span class="go">UK 247</span>
<span class="go">US 258</span>
<span class="go">dtype: int64</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title"><span class="yiyi-st" id="yiyi-197">注意</span></p>
<p><span class="yiyi-st" id="yiyi-198">一些函数应用于groupby对象时将自动变换输入,返回与原始形状相同的对象。</span><span class="yiyi-st" id="yiyi-199">传递<code class="docutils literal"><span class="pre">as_index=False</span></code>不会影响这些转换方法。</span></p>
<p><span class="yiyi-st" id="yiyi-200">例如:<code class="docutils literal"><span class="pre">fillna,</span> <span class="pre">ffill,</span> <span class="pre">bfill,</span> <span class="pre">shift</span></code>。</span></p>
<div class="last highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [95]: </span><span class="n">grouped</span><span class="o">.</span><span class="n">ffill</span><span class="p">()</span>
<span class="gr">Out[95]: </span>
<span class="go"> A B C</span>
<span class="go">0 1.539708 -1.166480 0.533026</span>
<span class="go">1 1.302092 -0.505754 0.533026</span>
<span class="go">2 -0.371983 1.104803 -0.651520</span>
<span class="go">3 -1.309622 1.118697 -1.161657</span>
<span class="go">4 -1.924296 0.396437 0.812436</span>
<span class="go">5 0.815643 0.367816 -0.469478</span>
<span class="go">6 -0.030651 1.376106 -0.645129</span>
<span class="go">.. ... ... ...</span>
<span class="go">993 0.012359 0.554602 -1.976159</span>
<span class="go">994 0.042312 -1.628835 1.013822</span>
<span class="go">995 -0.093110 0.683847 -0.774753</span>
<span class="go">996 -0.185043 1.438572 -0.774753</span>
<span class="go">997 -0.394469 -0.642343 0.011374</span>
<span class="go">998 -1.174126 1.857148 -0.774753</span>
<span class="go">999 0.234564 0.517098 0.393534</span>
<span class="go">[1000 rows x 3 columns]</span>
</pre></div>
</div>
</div>
<div class="section" id="new-syntax-to-window-and-resample-operations">
<span id="groupby-transform-window-resample"></span><h3><span class="yiyi-st" id="yiyi-201">New syntax to window and resample operations</span></h3>
<div class="versionadded">
<p><span class="yiyi-st" id="yiyi-202"><span class="versionmodified">版本0.18.1中的新功能。</span></span></p>
</div>
<p><span class="yiyi-st" id="yiyi-203">使用对groupby级别的重采样,扩展或滚动操作,需要应用辅助函数。</span><span class="yiyi-st" id="yiyi-204">然而,现在可以使用<code class="docutils literal"><span class="pre">resample()</span></code>,<code class="docutils literal"><span class="pre">expanding()</span></code>和<code class="docutils literal"><span class="pre">rolling()</span></code>作为groupbys上的方法。</span></p>
<p><span class="yiyi-st" id="yiyi-205">下面的示例将基于列A的组对列B的样本应用<code class="docutils literal"><span class="pre">rolling()</span></code>方法。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [96]: </span><span class="n">df_re</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'A'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span> <span class="o">+</span> <span class="p">[</span><span class="mi">5</span><span class="p">]</span> <span class="o">*</span> <span class="mi">10</span><span class="p">,</span>
<span class="gp"> ....:</span> <span class="s1">'B'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">20</span><span class="p">)})</span>
<span class="gp"> ....:</span>
<span class="gp">In [97]: </span><span class="n">df_re</span>
<span class="gr">Out[97]: </span>
<span class="go"> A B</span>
<span class="go">0 1 0</span>
<span class="go">1 1 1</span>
<span class="go">2 1 2</span>
<span class="go">3 1 3</span>
<span class="go">4 1 4</span>
<span class="go">5 1 5</span>
<span class="go">6 1 6</span>
<span class="go">.. .. ..</span>
<span class="go">13 5 13</span>
<span class="go">14 5 14</span>
<span class="go">15 5 15</span>
<span class="go">16 5 16</span>
<span class="go">17 5 17</span>
<span class="go">18 5 18</span>
<span class="go">19 5 19</span>
<span class="go">[20 rows x 2 columns]</span>
<span class="gp">In [98]: </span><span class="n">df_re</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span><span class="o">.</span><span class="n">rolling</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span><span class="o">.</span><span class="n">B</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span>
<span class="gr">Out[98]: </span>
<span class="go">A </span>
<span class="go">1 0 NaN</span>
<span class="go"> 1 NaN</span>
<span class="go"> 2 NaN</span>
<span class="go"> 3 1.5</span>
<span class="go"> 4 2.5</span>
<span class="go"> 5 3.5</span>
<span class="go"> 6 4.5</span>
<span class="go"> ... </span>
<span class="go">5 13 11.5</span>
<span class="go"> 14 12.5</span>
<span class="go"> 15 13.5</span>
<span class="go"> 16 14.5</span>
<span class="go"> 17 15.5</span>
<span class="go"> 18 16.5</span>
<span class="go"> 19 17.5</span>
<span class="go">Name: B, dtype: float64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-206"><code class="docutils literal"><span class="pre">expanding()</span></code>方法将为每个特定组的所有成员累积给定操作(在示例中为<code class="docutils literal"><span class="pre">sum()</span></code>)。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [99]: </span><span class="n">df_re</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span><span class="o">.</span><span class="n">expanding</span><span class="p">()</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span>
<span class="gr">Out[99]: </span>
<span class="go"> A B</span>
<span class="go">A </span>
<span class="go">1 0 1.0 0.0</span>
<span class="go"> 1 2.0 1.0</span>
<span class="go"> 2 3.0 3.0</span>
<span class="go"> 3 4.0 6.0</span>
<span class="go"> 4 5.0 10.0</span>
<span class="go"> 5 6.0 15.0</span>
<span class="go"> 6 7.0 21.0</span>
<span class="go">... ... ...</span>
<span class="go">5 13 20.0 46.0</span>
<span class="go"> 14 25.0 60.0</span>
<span class="go"> 15 30.0 75.0</span>
<span class="go"> 16 35.0 91.0</span>
<span class="go"> 17 40.0 108.0</span>
<span class="go"> 18 45.0 126.0</span>
<span class="go"> 19 50.0 145.0</span>
<span class="go">[20 rows x 2 columns]</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-207">假设您要使用<code class="docutils literal"><span class="pre">resample()</span></code>方法来获取每个数据帧的每日频率,并希望使用<code class="docutils literal"><span class="pre">ffill()</span></code>方法完成缺少的值。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [100]: </span><span class="n">df_re</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'date'</span><span class="p">:</span> <span class="n">pd</span><span class="o">.</span><span class="n">date_range</span><span class="p">(</span><span class="n">start</span><span class="o">=</span><span class="s1">'2016-01-01'</span><span class="p">,</span>
<span class="gp"> .....:</span> <span class="n">periods</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
<span class="gp"> .....:</span> <span class="n">freq</span><span class="o">=</span><span class="s1">'W'</span><span class="p">),</span>
<span class="gp"> .....:</span> <span class="s1">'group'</span><span class="p">:</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">],</span>
<span class="gp"> .....:</span> <span class="s1">'val'</span><span class="p">:</span> <span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">]})</span><span class="o">.</span><span class="n">set_index</span><span class="p">(</span><span class="s1">'date'</span><span class="p">)</span>
<span class="gp"> .....:</span>
<span class="gp">In [101]: </span><span class="n">df_re</span>
<span class="gr">Out[101]: </span>
<span class="go"> group val</span>
<span class="go">date </span>
<span class="go">2016-01-03 1 5</span>
<span class="go">2016-01-10 1 6</span>
<span class="go">2016-01-17 2 7</span>
<span class="go">2016-01-24 2 8</span>
<span class="gp">In [102]: </span><span class="n">df_re</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'group'</span><span class="p">)</span><span class="o">.</span><span class="n">resample</span><span class="p">(</span><span class="s1">'1D'</span><span class="p">)</span><span class="o">.</span><span class="n">ffill</span><span class="p">()</span>
<span class="gr">Out[102]: </span>
<span class="go"> group val</span>
<span class="go">group date </span>
<span class="go">1 2016-01-03 1 5</span>
<span class="go"> 2016-01-04 1 5</span>
<span class="go"> 2016-01-05 1 5</span>
<span class="go"> 2016-01-06 1 5</span>
<span class="go"> 2016-01-07 1 5</span>
<span class="go"> 2016-01-08 1 5</span>
<span class="go"> 2016-01-09 1 5</span>
<span class="go">... ... ...</span>
<span class="go">2 2016-01-18 2 7</span>
<span class="go"> 2016-01-19 2 7</span>
<span class="go"> 2016-01-20 2 7</span>
<span class="go"> 2016-01-21 2 7</span>
<span class="go"> 2016-01-22 2 7</span>
<span class="go"> 2016-01-23 2 7</span>
<span class="go"> 2016-01-24 2 8</span>
<span class="go">[16 rows x 2 columns]</span>
</pre></div>
</div>
</div>
</div>
<div class="section" id="filtration">
<span id="groupby-filter"></span><h2><span class="yiyi-st" id="yiyi-208">Filtration</span></h2>
<div class="versionadded">
<p><span class="yiyi-st" id="yiyi-209"><span class="versionmodified">版本0.12中的新功能。</span></span></p>
</div>
<p><span class="yiyi-st" id="yiyi-210"><code class="docutils literal"><span class="pre">filter</span></code>方法返回原始对象的子集。</span><span class="yiyi-st" id="yiyi-211">假设我们只想取得属于群组总和大于2的群组的元素。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [103]: </span><span class="n">sf</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">])</span>
<span class="gp">In [104]: </span><span class="n">sf</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">sf</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="o">></span> <span class="mi">2</span><span class="p">)</span>
<span class="gr">Out[104]: </span>
<span class="go">3 3</span>
<span class="go">4 3</span>
<span class="go">5 3</span>
<span class="go">dtype: int64</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-212"><code class="docutils literal"><span class="pre">filter</span></code>的参数必须是应用于整个组的函数,返回<code class="docutils literal"><span class="pre">True</span></code>或<code class="docutils literal"><span class="pre">False</span></code>。</span></p>
<p><span class="yiyi-st" id="yiyi-213">另一个有用的操作是过滤掉属于只有几个成员的组的元素。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [105]: </span><span class="n">dff</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">({</span><span class="s1">'A'</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">8</span><span class="p">),</span> <span class="s1">'B'</span><span class="p">:</span> <span class="nb">list</span><span class="p">(</span><span class="s1">'aabbbbcc'</span><span class="p">)})</span>
<span class="gp">In [106]: </span><span class="n">dff</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'B'</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">></span> <span class="mi">2</span><span class="p">)</span>
<span class="gr">Out[106]: </span>
<span class="go"> A B</span>
<span class="go">2 2 b</span>
<span class="go">3 3 b</span>
<span class="go">4 4 b</span>
<span class="go">5 5 b</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-214">或者,代替丢弃有问题的组,我们可以返回类似索引的对象,其中未通过过滤器的组用NaN填充。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [107]: </span><span class="n">dff</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'B'</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">></span> <span class="mi">2</span><span class="p">,</span> <span class="n">dropna</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="gr">Out[107]: </span>
<span class="go"> A B</span>
<span class="go">0 NaN NaN</span>
<span class="go">1 NaN NaN</span>
<span class="go">2 2.0 b</span>
<span class="go">3 3.0 b</span>
<span class="go">4 4.0 b</span>
<span class="go">5 5.0 b</span>
<span class="go">6 NaN NaN</span>
<span class="go">7 NaN NaN</span>
</pre></div>
</div>
<p><span class="yiyi-st" id="yiyi-215">对于具有多个列的DataFrames,过滤器应显式指定一个列作为过滤条件。</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [108]: </span><span class="n">dff</span><span class="p">[</span><span class="s1">'C'</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">8</span><span class="p">)</span>
<span class="gp">In [109]: </span><span class="n">dff</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'B'</span><span class="p">)</span><span class="o">.</span><span class="n">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="s1">'C'</span><span class="p">])</span> <span class="o">></span> <span class="mi">2</span><span class="p">)</span>
<span class="gr">Out[109]: </span>
<span class="go"> A B C</span>
<span class="go">2 2 b 2</span>
<span class="go">3 3 b 3</span>
<span class="go">4 4 b 4</span>
<span class="go">5 5 b 5</span>
</pre></div>
</div>
<div class="admonition note">
<p class="first admonition-title"><span class="yiyi-st" id="yiyi-216">注意</span></p>
<p><span class="yiyi-st" id="yiyi-217">应用于groupby对象时,某些函数将作为输入上的<strong>过滤器</strong>,返回原始缩减的形状(并可能消除组),但索引不变。</span><span class="yiyi-st" id="yiyi-218">传递<code class="docutils literal"><span class="pre">as_index=False</span></code>不会影响这些转换方法。</span></p>
<p><span class="yiyi-st" id="yiyi-219">例如:<code class="docutils literal"><span class="pre">head,</span> <span class="pre">tail</span></code>。</span></p>
<div class="last highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [110]: </span><span class="n">dff</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'B'</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="gr">Out[110]: </span>
<span class="go"> A B C</span>
<span class="go">0 0 a 0</span>
<span class="go">1 1 a 1</span>
<span class="go">2 2 b 2</span>
<span class="go">3 3 b 3</span>
<span class="go">6 6 c 6</span>
<span class="go">7 7 c 7</span>
</pre></div>
</div>
</div>
</div>
<div class="section" id="dispatching-to-instance-methods">
<span id="groupby-dispatch"></span><h2><span class="yiyi-st" id="yiyi-220">Dispatching to instance methods</span></h2>
<p><span class="yiyi-st" id="yiyi-221">当执行聚合或转换时,您可能只想对每个数据组调用实例方法。</span><span class="yiyi-st" id="yiyi-222">这通过传递lambda函数很容易做到:</span></p>
<div class="highlight-ipython"><div class="highlight"><pre><span></span><span class="gp">In [111]: </span><span class="n">grouped</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="s1">'A'</span><span class="p">)</span>