-
Notifications
You must be signed in to change notification settings - Fork 1
/
search.xml
1849 lines (1785 loc) · 219 KB
/
search.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<search>
<entry>
<title>多功能streamlit_App</title>
<url>/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/</url>
<content><![CDATA[<p># ✨多功能streamlit_App</p>
<h4 id="深度学习模型部署实现图片检测、视频检测、人脸识别(判断是否是同一个人)、图片分类。"><span class="heading-link">深度学习模型部署实现图片检测、视频检测、人脸识别(判断是否是同一个人)、图片分类。</span></h4><p>注:人脸识别为自己实现的功能,具体详细的介绍下面已经给出。</p>
<h1 id="demo演示"><span class="heading-link">demo演示</span></h1><p>[video(video-V13rvJQ6-1691047199053)(type-csdn)(url-<span class="external-link"><a href="https://live.csdn.net/v/embed/314987)(image-https://video-community.csdnimg.cn/vod-84deb4/70703b702bc171eebff37035d0b20102/snapshots/3f7b5a63871f461e914a0b5f827b7f58-00005.jpg?auth_key=4843982050-0-0-f5162d684ba0c9f88ee39b03e30de32a)(title-演示demo2)]" target="_blank" rel="noopener">https://live.csdn.net/v/embed/314987)(image-https://video-community.csdnimg.cn/vod-84deb4/70703b702bc171eebff37035d0b20102/snapshots/3f7b5a63871f461e914a0b5f827b7f58-00005.jpg?auth_key=4843982050-0-0-f5162d684ba0c9f88ee39b03e30de32a)(title-演示demo2)]</a><i class="fa fa-external-link"></i></span></p>
<h1 id="安装依赖"><span class="heading-link">安装依赖</span></h1><p><code>pip install -r requirements.txt # 本地安装</code></p>
<h1 id="运行项目"><span class="heading-link">运行项目</span></h1><p>首先现在人脸识别FaceModel文件夹中进行训练,得到model,然后将Facemodel.py文件中的pkl文件进行替换。然后在终端运行下面的命令。</p>
<p><code>streamlit run login.py</code></p>
<p>[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-GcD8ZTaY-1691047154847)(./Page_data/users.png)]</p>
<p>根据用户名和密码登录,然后进入主页面。</p>
<p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/1.png" alt="在这里插入图片描述"></p>
<p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/34dce2f01724488ca08098921401fac7.png" alt="在这里插入图片描述"></p>
<p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/db366e868b984d28a4b1e9451679e833.png" alt="在这里插入图片描述"></p>
<h1 id="功能介绍"><span class="heading-link">功能介绍</span></h1><h2 id="a-图片检测"><span class="heading-link">a. 图片检测</span></h2><p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/29633413150346a7a0db7ad1148344c3.png" alt="在这里插入图片描述"></p>
<h2 id="b-视频检测"><span class="heading-link">b. 视频检测</span></h2><p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/7285107c49234feea310ee7c1834cec8.png" alt="在这里插入图片描述"></p>
<h2 id="c-人脸识别"><span class="heading-link">c. 人脸识别</span></h2><p>由于人脸识别功能是自己写自己训练得到的,所以下面将对代码进行详细介绍。<br><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/76f0c5e7fdee436a8ee2666d5ae99102.png" alt="在这里插入图片描述"></p>
<h3 id="1-训练模型数据集"><span class="heading-link">1. 训练模型数据集</span></h3><figure class="highlight plain"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">CeleA前500个数据集和CASIA数据集</span><br></pre></td></tr></tbody></table></div></figure>
<h3 id="2-数据处理"><span class="heading-link">2. 数据处理</span></h3><p><strong>说明</strong>:由于<code>CeleA</code>和<code>CASIA</code>数据集不是单纯的人脸,所以用<code>MTCNN</code>模型提取人脸,然后保存到文件夹中。</p>
<h4 id="2-1-数据预处理"><span class="heading-link">2.1 数据预处理</span></h4><p>两个数据集包含的总的人脸个数和每个人的人脸个数数量不同,出现了数据不平衡的问题。所以在数据处理阶段我们采用<strong>数据融合</strong>的方法解决数据不平衡的问题,即将<code>CASIA</code>数据集和<code>CELEA</code>的前500个数据集进行融合。得到是数据存放在中间数据文件夹<code>temp_data</code>中。</p>
<ol>
<li><code>CASIA_face_detection.py</code>用于实现<code>CASIA</code>数据集的人脸检测并保存。<code>CASIA</code>数据集中同一个人有5张不同的图片。执行以下命令即可获得预处理后的<code>CASIA</code>数据集。</li>
</ol>
<figure class="highlight shell"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">python CASIA_face_detection.py trainSetCASIA_Dir CASIA_face_train_save_path</span><br><span class="line">【说明】</span><br><span class="line">trainSetCASIA_Dir为数据集CASIA的路径,</span><br><span class="line">CASIA_face_train_save_path为人脸检测后图片的保存路径。</span><br></pre></td></tr></tbody></table></div></figure>
<ol start="2">
<li><code>Data_processing_and_detection.py</code>实现<code>CELEA</code>数据集的人脸检测并保存。<code>CELEA</code>数据集同一个人有多张不同的图片(超过5张)。执行以下命令即可获得预处理后的<code>CELEA</code>数据集。</li>
</ol>
<figure class="highlight shell"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">python Data_processing_and_detection.py identify_path img_path save_path CeleA_and_CASIA_save_path </span><br><span class="line">【说明】</span><br><span class="line">identify_path为数据集CELEA的标签路径,</span><br><span class="line">img_path为CELEA人脸图片路径,</span><br><span class="line">save_path为人脸检测后图片的保存路径,</span><br><span class="line">CeleA_and_CASIA_save_path为CeleA数据集完成编号后保存人脸图片的路径(实现从500开始编号)。</span><br></pre></td></tr></tbody></table></div></figure>
<p><strong>经过以上两步的处理,我们得到预处理后的<code>CASIA</code>数据集和<code>CELEA</code>数据集。于是将两个数据集合并到同一个文件夹,其中包含1000个人的人脸图像,000-499来源于<code>CASIA</code>数据集,500-1000来源于<code>CeleA</code>数据集。</strong></p>
<h4 id="2-2-MTCNN模型原理"><span class="heading-link">2.2 MTCNN模型原理</span></h4><p>第一阶段是使用一种叫做<code>PNet</code>(Proposal Network)的卷积神经网络,获得候选窗体和边界回归向量。同时,候选窗体根据边界框进行校准。然后利用非极大值抑制去除重叠窗体。第二阶段是使用<code>R-Net</code>(Refine Network)卷积神经网络进行操作,将经过<code>P-Net</code>确定的包含候选窗体的图片在<code>R-Net</code>中训练,最后使用全连接网络进行分类。利用边界框向量微调候选窗体,最后还是利用非极大值抑制算法去除重叠窗体。第三阶段,使用<code>Onet</code>(Output Network)卷积神经网络进行操作,该网络比<code>R-Net</code>多一层卷积层,功能与<code>R-Net</code>类似。网络结构图如下。</p>
<p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/31529324a94c41a5b41726cfa07290c1.jpeg" alt="在这里插入图片描述"></p>
<h3 id="3-模型处理"><span class="heading-link">3. 模型处理</span></h3><h4 id="3-1-模型训练"><span class="heading-link">3.1 模型训练</span></h4><figure class="highlight markdown"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">执行命令: python train.py trainSetDir modelPath</span><br><span class="line">说明:将trainSetDir改为训练集所在的路径,将modelPath改为模型保存的路径。</span><br></pre></td></tr></tbody></table></div></figure>
<h4 id="数据增强:从数据集随机选择两张图片,对其进行添加椒盐噪声,水平翻转、高斯噪声、平移缩放旋转,其中数据增强是随机的。"><span class="heading-link">数据增强:从数据集随机选择两张图片,对其进行添加椒盐噪声,水平翻转、高斯噪声、平移缩放旋转,其中数据增强是随机的。</span></h4><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">transform = transforms.Compose([transforms.Resize((<span class="number">256</span>, <span class="number">256</span>)),</span><br><span class="line"> AddPepperNoise(<span class="number">0.9</span>,<span class="number">0.5</span>),</span><br><span class="line"> transforms.ToTensor()])</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">albu_transfomem = A.Compose([A.HorizontalFlip(p = <span class="number">0.5</span>),</span><br><span class="line"> A.OneOf([</span><br><span class="line"> A.IAAAdditiveGaussianNoise(),</span><br><span class="line"> A.GaussNoise(var_limit=(<span class="number">10</span>,<span class="number">80</span>))</span><br><span class="line"> ],p=<span class="number">0.8</span>),</span><br><span class="line"> A.ShiftScaleRotate(scale_limit = <span class="number">0.1</span>,rotate_limit=<span class="number">15</span>,p=<span class="number">0.6</span>)</span><br><span class="line"> ],p=<span class="number">0.6</span>)</span><br></pre></td></tr></tbody></table></div></figure>
<h4 id="训练参数设置"><span class="heading-link">训练参数设置</span></h4><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">train_epochs = <span class="number">200</span></span><br><span class="line">in_shape = [<span class="number">256</span>, <span class="number">256</span>]</span><br><span class="line">train_batch_size = <span class="number">16</span></span><br><span class="line">ptimizer = torch.optim.Adam(net.parameters(), <span class="number">0.0001</span>, betas=(<span class="number">0.9</span>, <span class="number">0.999</span>)) <span class="comment">#初始学习率为0.0001</span></span><br><span class="line">torch.manual_seed(<span class="number">22</span>) <span class="comment">#随机种子,防止模型出现随机性</span></span><br><span class="line">np.random.seed(<span class="number">22</span>)</span><br><span class="line">criterion = torch.nn.BCEWithLogitsLoss() <span class="comment"># 交叉熵损失函数</span></span><br></pre></td></tr></tbody></table></div></figure>
<h4 id="3-2-模型预测"><span class="heading-link">3. 2 模型预测</span></h4><figure class="highlight markdown"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">执行命令: python test.py testSetDir resultPath</span><br><span class="line">说明:将testSetDir改为测试集所在的路径,将resultPath改为结果保存的路径。</span><br></pre></td></tr></tbody></table></div></figure>
<h4 id="3-3-模型原理"><span class="heading-link">3.3 模型原理</span></h4><p>通俗的来说<strong><code>SENet</code>的核心思想在于通过网络根据loss去学习特征权重,使得有效的feature map权重大,无效或效果小的feature map权重小的方式训练模型达到更好的结果</strong>。<code>SENet-154</code>的构建是将<code>SE</code>块合并到<code>64×4d ResNeXt-152</code>的修改版本中,该版本采用<code>ResNet-152</code>的块堆叠策略,扩展了原来的<code>ResNeXt-101</code>。<code>SE</code>结构图如下。</p>
<p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/7f47dcc0a0a94f76ae384e8941359fdf.jpeg" alt="在这里插入图片描述"></p>
<p><code>SENET-154</code>与<code>SE</code>存在的其他差异如下:</p>
<ul>
<li>将第一个7×7卷积层替换为3个连续的3 × 3卷积层</li>
<li><code>每个bottleneck building block</code>的前1 × 1个卷积通道的数量减半,以降低模型的计算成本,同时性能下降最小。</li>
<li>为了减少过拟合,在分类层之前插入一个<code>dropout layer (dropout ratio为0.2)</code>。</li>
<li>在训练过程中使用了标签平滑正则化。</li>
</ul>
<h2 id="d-图片分类"><span class="heading-link">d. 图片分类</span></h2><p><img src="/2023/10/24/%E5%A4%9A%E5%8A%9F%E8%83%BDstreamlit-App/fd6d24f448684cb6982add980312880e.png" alt="在这里插入图片描述"></p>
<h1 id="参考链接"><span class="heading-link">参考链接</span></h1><p>[1] <span class="external-link"><a href="https://streamlit.io/" target="_blank" rel="noopener">streamlit</a><i class="fa fa-external-link"></i></span></p>
<p>[2] <span class="external-link"><a href="https://github.com/xugaoxiang/yolov5-streamlit/tree/main" target="_blank" rel="noopener">YOLOv5 检测</a><i class="fa fa-external-link"></i></span></p>
<p>[3] <span class="external-link"><a href="https://github.com/1648027181/image_classification_pytorch_app" target="_blank" rel="noopener">图像分类</a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>深度学习部署</category>
</categories>
<tags>
<tag>CV</tag>
</tags>
</entry>
<entry>
<title>【第一期AI夏令营丨自然语言处理】使用BERT模型解决问题</title>
<url>/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E4%BD%BF%E7%94%A8BERT%E6%A8%A1%E5%9E%8B%E8%A7%A3%E5%86%B3%E9%97%AE%E9%A2%98/</url>
<content><![CDATA[<p># 一、使用预训练的BERT模型解决文本二分类问题</p>
<h2 id="深度学习模型训练的一般步骤:"><span class="heading-link">深度学习模型训练的一般步骤:</span></h2><ol>
<li>导入前置依赖</li>
<li>设置全局配置</li>
<li>进行数据读取与数据预处理</li>
<li>构建训练所需的dataloader与dataset</li>
<li>定义预测模型</li>
<li>定义出损失函数和优化器</li>
<li>定义一个验证方法,获取到验证集的精准率和loss。</li>
<li>模型训练,保存最好的模型</li>
<li>加载最好的模型,然后进行测试集的预测</li>
<li>将测试数据送入模型,得到结果</li>
</ol>
<h2 id="1-导入前置依赖"><span class="heading-link">1. 导入前置依赖</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> os</span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line"><span class="keyword">import</span> torch</span><br><span class="line"><span class="keyword">from</span> torch <span class="keyword">import</span> nn</span><br><span class="line"><span class="keyword">from</span> torch.utils.data <span class="keyword">import</span> Dataset, DataLoader</span><br><span class="line"><span class="comment"># 用于加载bert模型的分词器</span></span><br><span class="line"><span class="keyword">from</span> transformers <span class="keyword">import</span> AutoTokenizer</span><br><span class="line"><span class="comment"># 用于加载bert模型</span></span><br><span class="line"><span class="keyword">from</span> transformers <span class="keyword">import</span> BertModel</span><br><span class="line"><span class="keyword">from</span> pathlib <span class="keyword">import</span> Path</span><br></pre></td></tr></tbody></table></div></figure>
<p>当我们需要导入项目中的摸个函数时,应该这样操作:</p>
<blockquote>
<p>from 文件夹名.某个py文件 import 某个函数 </p>
</blockquote>
<p>例如在当前目录下有一个FaceModel文件夹,文件夹下有一个faceModel.py, py文件下有一个predict函数,那应该如何操作呢? </p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">from</span> FaceModel.faceModel <span class="keyword">import</span> predict</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="2-设置全局配置"><span class="heading-link">2.设置全局配置</span></h2><p>主要设置一些超参数。超参数是在开 始学习过程之前设置值的参数,而不是通过训练得到的参数数据。通常情况下,需要对超参数进行优化,给学习机选择一组最优超参数,以提高学习的性能和效果。</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">batch_size = <span class="number">16</span></span><br><span class="line"><span class="comment"># 文本的最大长度</span></span><br><span class="line">text_max_length = <span class="number">128</span></span><br><span class="line"><span class="comment"># 总训练的epochs数</span></span><br><span class="line">epochs = <span class="number">100</span></span><br><span class="line"><span class="comment"># 学习率</span></span><br><span class="line">lr = <span class="number">3e-5</span></span><br><span class="line"><span class="comment"># 取多少训练集的数据作为验证集</span></span><br><span class="line">validation_ratio = <span class="number">0.1</span></span><br><span class="line">device = torch.device(<span class="string">'cuda'</span> <span class="keyword">if</span> torch.cuda.is_available() <span class="keyword">else</span> <span class="string">'cpu'</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 每多少步,打印一次loss</span></span><br><span class="line">log_per_step = <span class="number">50</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 数据集所在位置</span></span><br><span class="line">dataset_dir = Path(<span class="string">"./data"</span>)</span><br><span class="line">os.makedirs(dataset_dir) <span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(dataset_dir) <span class="keyword">else</span> <span class="string">''</span> <span class="comment"># 当不存在这一文件夹时,就创建它</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 模型存储路径</span></span><br><span class="line">model_dir = Path(<span class="string">"./model/bert_checkpoints"</span>)</span><br><span class="line"><span class="comment"># 如果模型目录不存在,则创建一个</span></span><br><span class="line">os.makedirs(model_dir) <span class="keyword">if</span> <span class="keyword">not</span> os.path.exists(model_dir) <span class="keyword">else</span> <span class="string">''</span></span><br><span class="line"></span><br><span class="line">print(<span class="string">"Device:"</span>, device)</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="3-进行数据读取与数据预处理"><span class="heading-link">3. 进行数据读取与数据预处理</span></h2><p>数据预处理的常见步骤:</p>
<ol>
<li><strong>数据清洗</strong>:检查数据中的缺失值、异常值、重复值等情况,并进行相应处理。可以使用插补方法填充缺失值,剔除异常值或者利用统计方法进行处理。</li>
<li><strong>特征选择</strong>:根据实际问题和领域知识,选择最相关和有用的特征。可以使用相关性分析、特征重要性评估等方法进行特征选择。</li>
<li><strong>特征缩放</strong>:将不同尺度或数量级的特征进行缩放,以保证模型的准确性和稳定性。常见的特征缩放方法包括标准化和归一化。</li>
<li><strong>特征编码</strong>:将非数值型的特征转换为数值型,以便模型可以进行处理。可以使用独热编码、标签编码等方法进行特征编码。</li>
<li><strong>数据集划分</strong>:将数据集划分为训练集、验证集和测试集。训练集用于模型训练,验证集用于模型调优和选择,测试集用于评估模型性能。</li>
<li><strong>处理类别不平衡</strong>:如果数据集中存在类别不平衡问题,可以采取一些方法来处理,例如欠采样、过采样等。</li>
</ol>
<p>具体的预处理方法和步骤会根据具体的数据和问题而有所不同。在实际应用中,根据具体情况选择适当的数据预处理方法非常重要,以提高模型的性能和准确性。</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 读取数据集,进行数据处理</span></span><br><span class="line"></span><br><span class="line">pd_train_data = pd.read_csv(<span class="string">'./data/train.csv'</span>)</span><br><span class="line">pd_train_data[<span class="string">'title'</span>] = pd_train_data[<span class="string">'title'</span>].fillna(<span class="string">''</span>) <span class="comment"># 缺失值将被替换为空字符串</span></span><br><span class="line">pd_train_data[<span class="string">'abstract'</span>] = pd_train_data[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line">test_data = pd.read_csv(<span class="string">'./data/test.csv'</span>)</span><br><span class="line">test_data[<span class="string">'title'</span>] = test_data[<span class="string">'title'</span>].fillna(<span class="string">''</span>)</span><br><span class="line">test_data[<span class="string">'abstract'</span>] = test_data[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"><span class="comment"># 将几个字段连接在一起 形成一行文本数据</span></span><br><span class="line">pd_train_data[<span class="string">'text'</span>] = pd_train_data[<span class="string">'title'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + pd_train_data[<span class="string">'author'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + pd_train_data[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)+ <span class="string">' '</span> + pd_train_data[<span class="string">'Keywords'</span>].fillna(<span class="string">''</span>)</span><br><span class="line">test_data[<span class="string">'text'</span>] = test_data[<span class="string">'title'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + test_data[<span class="string">'author'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + test_data[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)+ <span class="string">' '</span> + pd_train_data[<span class="string">'Keywords'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 从训练集中随机采样测试集</span></span><br><span class="line">validation_data = pd_train_data.sample(frac=validation_ratio)</span><br><span class="line">train_data = pd_train_data[~pd_train_data.index.isin(validation_data.index)]<span class="comment"># 获取不在验证集索引中的数据行</span></span><br></pre></td></tr></tbody></table></div></figure>
<p>构建数据集,将数据集划分为训练集、验证集和测试集</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 构建Dataset</span></span><br><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MyDataset</span><span class="params">(Dataset)</span>:</span></span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self, mode=<span class="string">'train'</span>)</span>:</span></span><br><span class="line"> super(MyDataset, self).__init__()</span><br><span class="line"> self.mode = mode</span><br><span class="line"> <span class="comment"># 拿到对应的数据</span></span><br><span class="line"> <span class="keyword">if</span> mode == <span class="string">'train'</span>:</span><br><span class="line"> self.dataset = train_data</span><br><span class="line"> <span class="keyword">elif</span> mode == <span class="string">'validation'</span>:</span><br><span class="line"> self.dataset = validation_data</span><br><span class="line"> <span class="keyword">elif</span> mode == <span class="string">'test'</span>:</span><br><span class="line"> <span class="comment"># 如果是测试模式,则返回内容和uuid。拿uuid做target主要是方便后面写入结果。</span></span><br><span class="line"> self.dataset = test_data</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> <span class="keyword">raise</span> Exception(<span class="string">"Unknown mode {}"</span>.format(mode))</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__getitem__</span><span class="params">(self, index)</span>:</span></span><br><span class="line"> <span class="comment"># 取第index条</span></span><br><span class="line"> data = self.dataset.iloc[index]</span><br><span class="line"> <span class="comment"># 取其内容</span></span><br><span class="line"> text = data[<span class="string">'text'</span>]</span><br><span class="line"> <span class="comment"># 根据状态返回内容</span></span><br><span class="line"> <span class="keyword">if</span> self.mode == <span class="string">'test'</span>:</span><br><span class="line"> <span class="comment"># 如果是test,将uuid做为target</span></span><br><span class="line"> label = data[<span class="string">'uuid'</span>]</span><br><span class="line"> <span class="keyword">else</span>:</span><br><span class="line"> label = data[<span class="string">'label'</span>]</span><br><span class="line"> <span class="comment"># 返回内容和label</span></span><br><span class="line"> <span class="keyword">return</span> text, label</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__len__</span><span class="params">(self)</span>:</span></span><br><span class="line"> <span class="keyword">return</span> len(self.dataset)</span><br><span class="line"></span><br><span class="line">train_dataset = MyDataset(<span class="string">'train'</span>)</span><br><span class="line">validation_dataset = MyDataset(<span class="string">'validation'</span>)</span><br><span class="line"><span class="comment"># 获取Bert预训练模型</span></span><br><span class="line">tokenizer = AutoTokenizer.from_pretrained(<span class="string">"bert-base-uncased"</span>) <span class="comment"># 使用Hugging Face库中的AutoTokenizer类来加载预训练的BERT模型的tokenizer</span></span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="4-构建训练所需的dataloader与dataset"><span class="heading-link">4. 构建训练所需的dataloader与dataset</span></h2><p>接构造Dataloader,需要定义一下collate_fn,在其中完成对句子进行编码、填充、组装batch等动作:</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">collate_fn</span><span class="params">(batch)</span>:</span></span><br><span class="line"> <span class="string">"""</span></span><br><span class="line"><span class="string"> 将一个batch的文本句子转成tensor,并组成batch。</span></span><br><span class="line"><span class="string"> :param batch: 一个batch的句子,例如: [('推文', target), ('推文', target), ...]</span></span><br><span class="line"><span class="string"> :return: 处理后的结果,例如:</span></span><br><span class="line"><span class="string"> src: {'input_ids': tensor([[ 101, ..., 102, 0, 0, ...], ...]), 'attention_mask': tensor([[1, ..., 1, 0, ...], ...])}</span></span><br><span class="line"><span class="string"> target:[1, 1, 0, ...]</span></span><br><span class="line"><span class="string"> """</span></span><br><span class="line"> text, label = zip(*batch) <span class="comment"># 带有星号(*)作为前缀的参数表示可变长度的位置参数。</span></span><br><span class="line"> print(<span class="string">'text:'</span>,text, <span class="string">'label:'</span>,label)</span><br><span class="line"> text, label = list(text), list(label)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># src是要送给bert的,所以不需要特殊处理,直接用tokenizer的结果即可 编码后的文本数据。</span></span><br><span class="line"> <span class="comment"># padding='max_length' 不够长度的进行填充</span></span><br><span class="line"> <span class="comment"># truncation=True 长度过长的进行裁剪</span></span><br><span class="line"> <span class="comment"># eturn_tensors=‘pt’: 指定返回PyTorch张量对象。</span></span><br><span class="line"> src = tokenizer(text, padding=<span class="string">'max_length'</span>, max_length=text_max_length, return_tensors=<span class="string">'pt'</span>, truncation=<span class="literal">True</span>)</span><br><span class="line"> print(<span class="string">'src:'</span>,src)</span><br><span class="line"> <span class="keyword">return</span> src, torch.LongTensor(label)</span><br><span class="line"></span><br><span class="line">train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=<span class="literal">True</span>, collate_fn=collate_fn)</span><br><span class="line">validation_loader = DataLoader(validation_dataset, batch_size=batch_size, shuffle=<span class="literal">False</span>, collate_fn=collate_fn)</span><br></pre></td></tr></tbody></table></div></figure>
<p><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E4%BD%BF%E7%94%A8BERT%E6%A8%A1%E5%9E%8B%E8%A7%A3%E5%86%B3%E9%97%AE%E9%A2%98/2.png" alt="image.png"><br><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E4%BD%BF%E7%94%A8BERT%E6%A8%A1%E5%9E%8B%E8%A7%A3%E5%86%B3%E9%97%AE%E9%A2%98/1.png" alt=""><br><strong>下面是对BERT模型的详细介绍:</strong></p>
<ol>
<li><strong>架构</strong>:BERT模型的核心是Transformer架构,它由多个编码器层组成。每个编码器层都由多头自注意力机制(Multi-Head Self-Attention)和前馈神经网络(Feed-Forward Neural Network)组成。</li>
<li><strong>预训练阶段</strong>:BERT在预训练阶段通过两个自监督任务来学习文本表示:Masked Language Model(MLM)和Next Sentence Prediction(NSP)。</li>
<li><strong>MLM</strong>:模型随机地遮盖输入文本的一部分单词,并训练来预测这些被遮盖的单词。这样可以使模型学会理解上下文和句子中的关系以及词汇的表征。</li>
<li><strong>NSP</strong>:模型输入两个句子,并判断这两个句子是否相邻。这个任务可以使模型学会理解句子级别的关系和上下文之间的相关性。</li>
<li><strong>微调阶段</strong>:在预训练阶段得到的BERT模型可以在特定的下游任务上进行微调。这些下游任务可能包括文本分类、命名实体识别、问答等。在微调阶段,BERT模型通过在下游任务上进行有监督学习来进一步优化和适应。</li>
<li><strong>输入表示</strong>:BERT模型的输入通常是经过分词(tokenization)后的文本。BERT使用WordPiece分词技术将输入序列拆分为多个子词(subword)。每个子词都有一个唯一的标记,并且可以通过词嵌入得到对应的向量表示。</li>
<li><strong>输出表示</strong>:BERT模型在每一层的输出都包含了每个输入的表示。通常情况下,我们只使用最后一层的输出作为输入文本的表示,也可以使用多层的输出进行组合。</li>
<li><strong>上下文无关性和上下文敏感性</strong>:BERT模型通过上下文无关的方式进行预训练。这意味着模型可以独立地对每个输入进行编码,而不考虑其上下文信息。在微调和应用阶段,BERT模型可以根据需要进行上下文敏感性编码。</li>
</ol>
<p><strong>BERT模型的优点</strong>是能够学习到更好的语言表示,能够根据上下文理解词汇的含义和句子的关系,并在各种下游任务上取得了良好的性能。但它也有一些限制,例如计算资源要求较高,模型较大,需要较长的训练时间。</p>
<h2 id="5-定义预测模型"><span class="heading-link">5. 定义预测模型</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="class"><span class="keyword">class</span> <span class="title">MyModel</span><span class="params">(nn.Module)</span>:</span></span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">__init__</span><span class="params">(self)</span>:</span></span><br><span class="line"> super(MyModel, self).__init__()</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 加载bert模型</span></span><br><span class="line"> self.bert = BertModel.from_pretrained(<span class="string">'bert-base-uncased'</span>, mirror=<span class="string">'tuna'</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 最后的预测层</span></span><br><span class="line"> self.predictor = nn.Sequential(</span><br><span class="line"> nn.Linear(<span class="number">768</span>, <span class="number">256</span>),</span><br><span class="line"> nn.ReLU(),</span><br><span class="line"> nn.Linear(<span class="number">256</span>, <span class="number">1</span>),</span><br><span class="line"> nn.Sigmoid()</span><br><span class="line"> )</span><br><span class="line"></span><br><span class="line"> <span class="function"><span class="keyword">def</span> <span class="title">forward</span><span class="params">(self, src)</span>:</span></span><br><span class="line"> <span class="string">"""</span></span><br><span class="line"><span class="string"> :param src: 分词后的推文数据</span></span><br><span class="line"><span class="string"> """</span></span><br><span class="line"></span><br><span class="line"> <span class="comment"># 将src直接序列解包传入bert,因为bert和tokenizer是一套的,所以可以这么做。</span></span><br><span class="line"> <span class="comment"># 得到encoder的输出,用最前面[CLS]的输出作为最终线性层的输入</span></span><br><span class="line"> </span><br><span class="line"> <span class="comment"># ".last_hidden_state[:, 0, :]“的操作,从模型的最后一个隐藏状态中提取有用的信息</span></span><br><span class="line"> <span class="comment"># 使用切片操作”[:, 0, :]",即保留所有样本的第0个位置的隐藏状态特征,</span></span><br><span class="line"> <span class="comment">#对应于BERT模型的CLS(Classification)标记。这个CLS特征通常被用作文本分类或序列标注任务的整体表示。</span></span><br><span class="line"> outputs = self.bert(**src).last_hidden_state[:, <span class="number">0</span>, :] </span><br><span class="line"></span><br><span class="line"> <span class="comment"># 使用线性层来做最终的预测</span></span><br><span class="line"> <span class="keyword">return</span> self.predictor(outputs)</span><br><span class="line"></span><br><span class="line">model = MyModel()</span><br><span class="line">model = model.to(device)</span><br></pre></td></tr></tbody></table></div></figure>
<p>BERT模型在文本前插入一个[CLS]符号,并将该符号对应的输出向量作为整篇文本的语义表示,用于文本分类,如下图所示。可以理解为:与文本中已有的其它字/词相比,这个无明显语义信息的符号会更“公平”地融合文本中各个字/词的语义信息。</p>
<h2 id="6-定义出损失函数和优化器"><span class="heading-link">6. 定义出损失函数和优化器</span></h2><p><strong>二元交叉熵(Binary Cross Entropy)</strong>是一种用于衡量两个概率分布之间差异的损失函数,通常用于二分类问题。<br><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E4%BD%BF%E7%94%A8BERT%E6%A8%A1%E5%9E%8B%E8%A7%A3%E5%86%B3%E9%97%AE%E9%A2%98/2690707a80bb483fa77edc1eec5ad60a.png" alt="在这里插入图片描述"><br>其中,L表示损失,y是真实标签(取值为0或1),p是模型输出的概率(预测为类别1的概率)。当y为1时,损失函数的第一项起作用,计算的是模型正确预测为类别1的概率的对数。当y为0时,损失函数的第二项起作用,计算的是模型正确预测为类别0的概率的对数。<br>通过最小化二元交叉熵损失函数,我们可以使得模型对两个类别的分类更加准确。</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">criteria = nn.BCELoss()</span><br><span class="line">optimizer = torch.optim.Adam(model.parameters(), lr=lr)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 由于inputs是字典类型的,定义一个辅助函数帮助to(device)</span></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">to_device</span><span class="params">(dict_tensors)</span>:</span></span><br><span class="line"> result_tensors = {}</span><br><span class="line"> <span class="keyword">for</span> key, value <span class="keyword">in</span> dict_tensors.items():</span><br><span class="line"> result_tensors[key] = value.to(device) <span class="comment"># 将张量移动到指定的设备上</span></span><br><span class="line"> <span class="keyword">return</span> result_tensors</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="7-定义一个验证方法,获取到验证集的精准率和loss"><span class="heading-link">7. 定义一个验证方法,获取到验证集的精准率和loss</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">validate</span><span class="params">()</span>:</span></span><br><span class="line"> model.eval() <span class="comment"># 将模型切换到评估模式</span></span><br><span class="line"> total_loss = <span class="number">0.</span></span><br><span class="line"> total_correct = <span class="number">0</span></span><br><span class="line"> <span class="keyword">for</span> inputs, targets <span class="keyword">in</span> validation_loader:</span><br><span class="line"> inputs, targets = to_device(inputs), targets.to(device)</span><br><span class="line"> outputs = model(inputs)</span><br><span class="line"> <span class="comment"># view(-1) 可以自动计算其他维度的大小</span></span><br><span class="line"> loss = criteria(outputs.view(<span class="number">-1</span>), targets.float()) <span class="comment"># outputs.view(-1) 意味着它会将原始张量中的所有元素展平,并将它们放置在一个单一的维度中</span></span><br><span class="line"> total_loss += float(loss)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 计算模型的预测结果与目标数据之间的正确预测数量</span></span><br><span class="line"> correct_num = (((outputs >= <span class="number">0.5</span>).float() * <span class="number">1</span>).flatten() == targets).sum()</span><br><span class="line"> total_correct += correct_num <span class="comment"># 累加每个批次的正确预测数量</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> total_correct / len(validation_dataset), total_loss / len(validation_dataset)</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="8-模型训练,保存最好的模型"><span class="heading-link">8. 模型训练,保存最好的模型</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 首先将模型调成训练模式</span></span><br><span class="line">model.train()</span><br><span class="line"></span><br><span class="line"><span class="comment"># 清空一下cuda缓存</span></span><br><span class="line"><span class="keyword">if</span> torch.cuda.is_available():</span><br><span class="line"> torch.cuda.empty_cache()</span><br><span class="line"></span><br><span class="line"><span class="comment"># 定义几个变量,帮助打印loss</span></span><br><span class="line">total_loss = <span class="number">0.</span></span><br><span class="line"><span class="comment"># 记录步数</span></span><br><span class="line">step = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 记录在验证集上最好的准确率</span></span><br><span class="line">best_accuracy = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"><span class="comment"># 开始训练</span></span><br><span class="line"><span class="keyword">for</span> epoch <span class="keyword">in</span> range(epochs):</span><br><span class="line"> model.train()</span><br><span class="line"> <span class="keyword">for</span> i, (inputs, targets) <span class="keyword">in</span> enumerate(train_loader):</span><br><span class="line"> <span class="comment"># 从batch中拿到训练数据</span></span><br><span class="line"> inputs, targets = to_device(inputs), targets.to(device)</span><br><span class="line"> <span class="comment"># 传入模型进行前向传递</span></span><br><span class="line"> outputs = model(inputs)</span><br><span class="line"> <span class="comment"># 计算损失</span></span><br><span class="line"> loss = criteria(outputs.view(<span class="number">-1</span>), targets.float())</span><br><span class="line"> loss.backward()</span><br><span class="line"> optimizer.step()</span><br><span class="line"> optimizer.zero_grad()</span><br><span class="line"></span><br><span class="line"> total_loss += float(loss)</span><br><span class="line"> step += <span class="number">1</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">if</span> step % log_per_step == <span class="number">0</span>:</span><br><span class="line"> print(<span class="string">"Epoch {}/{}, Step: {}/{}, total loss:{:.4f}"</span>.format(epoch+<span class="number">1</span>, epochs, i, len(train_loader), total_loss))</span><br><span class="line"> total_loss = <span class="number">0</span></span><br><span class="line"></span><br><span class="line"> <span class="keyword">del</span> inputs, targets</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 一个epoch后,使用过验证集进行验证</span></span><br><span class="line"> accuracy, validation_loss = validate()</span><br><span class="line"> print(<span class="string">"Epoch {}, accuracy: {:.4f}, validation loss: {:.4f}"</span>.format(epoch+<span class="number">1</span>, accuracy, validation_loss))</span><br><span class="line"> torch.save(model, model_dir / <span class="string">f"model_<span class="subst">{epoch}</span>.pt"</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># 保存最好的模型</span></span><br><span class="line"> <span class="keyword">if</span> accuracy > best_accuracy:</span><br><span class="line"> torch.save(model, model_dir / <span class="string">f"model_best.pt"</span>)</span><br><span class="line"> best_accuracy = accuracy</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="9-加载最好的模型,然后进行测试集的预测"><span class="heading-link">9. 加载最好的模型,然后进行测试集的预测</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">model = torch.load(model_dir / <span class="string">f"model_best.pt"</span>)</span><br><span class="line">model = model.eval()</span><br><span class="line"></span><br><span class="line">test_dataset = MyDataset(<span class="string">'test'</span>)</span><br><span class="line">test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=<span class="literal">False</span>, collate_fn=collate_fn)</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="10-将测试数据送入模型,得到结果"><span class="heading-link">10. 将测试数据送入模型,得到结果</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">results = []</span><br><span class="line"><span class="keyword">for</span> inputs, ids <span class="keyword">in</span> test_loader:</span><br><span class="line"> outputs = model(inputs.to(device))</span><br><span class="line"> outputs = (outputs >= <span class="number">0.5</span>).int().flatten().tolist()</span><br><span class="line"> ids = ids.tolist()</span><br><span class="line"> results = results + [(id, result) <span class="keyword">for</span> result, id <span class="keyword">in</span> zip(outputs, ids)]</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">test_label = [pair[<span class="number">1</span>] <span class="keyword">for</span> pair <span class="keyword">in</span> results]</span><br><span class="line">test_data[<span class="string">'label'</span>] = test_label</span><br><span class="line">test_data[[<span class="string">'uuid'</span>, <span class="string">'label'</span>]].to_csv(<span class="string">'submit_task1_test.csv'</span>, index=<span class="literal">None</span>)</span><br></pre></td></tr></tbody></table></div></figure>
<h1 id="二、Bert-for-关键词提取"><span class="heading-link">二、Bert_for_关键词提取</span></h1><h2 id="1-导入前置依赖-1"><span class="heading-link">1. 导入前置依赖</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 导入pandas用于读取表格数据</span></span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳</span></span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> TfidfVectorizer</span><br><span class="line"><span class="comment"># 导入Bert模型</span></span><br><span class="line"><span class="keyword">from</span> sentence_transformers <span class="keyword">import</span> SentenceTransformer</span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入计算相似度前置库,为了计算候选者和文档之间的相似度,我们将使用向量之间的余弦相似度,因为它在高维度下表现得相当好。</span></span><br><span class="line"><span class="keyword">from</span> sklearn.metrics.pairwise <span class="keyword">import</span> cosine_similarity</span><br><span class="line"></span><br><span class="line"><span class="comment"># 过滤警告消息</span></span><br><span class="line"><span class="keyword">from</span> warnings <span class="keyword">import</span> simplefilter</span><br><span class="line"><span class="keyword">from</span> sklearn.exceptions <span class="keyword">import</span> ConvergenceWarning</span><br><span class="line">simplefilter(<span class="string">"ignore"</span>, category=ConvergenceWarning)</span><br><span class="line"></span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> TfidfVectorizer</span><br><span class="line"></span><br><span class="line">texts=[<span class="string">"dog cat fish"</span>,<span class="string">"dog cat cat"</span>,<span class="string">"fish bird"</span>, <span class="string">'bird'</span>] <span class="comment"># “dog cat fish” 为输入列表元素,即代表一个文章的字符串</span></span><br><span class="line">cv = TfidfVectorizer()<span class="comment">#创建词袋数据结构</span></span><br><span class="line">cv_fit=cv.fit_transform(texts)</span><br><span class="line"><span class="comment">#上述代码等价于下面两行</span></span><br><span class="line"><span class="comment">#cv.fit(texts)</span></span><br><span class="line"><span class="comment">#cv_fit=cv.transform(texts)</span></span><br><span class="line"></span><br><span class="line">print(cv.get_feature_names()) <span class="comment">#['bird', 'cat', 'dog', 'fish'] 列表形式呈现文章生成的词典</span></span><br><span class="line"></span><br><span class="line">print(cv.vocabulary_ ) <span class="comment"># {‘dog’:2,'cat':1,'fish':3,'bird':0} 字典形式呈现,key:词,value:词频</span></span><br><span class="line"></span><br><span class="line">print(cv_fit) <span class="comment"># 第n个列表元素,**词典中索引为n的元素**, 词频</span></span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="2-读取数据集并处理"><span class="heading-link">2. 读取数据集并处理</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 读取数据集</span></span><br><span class="line">test = pd.read_csv(<span class="string">'./data/testB.csv'</span>)</span><br><span class="line">test[<span class="string">'title'</span>] = test[<span class="string">'title'</span>].fillna(<span class="string">''</span>)</span><br><span class="line">test[<span class="string">'abstract'</span>] = test[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">test[<span class="string">'text'</span>] = test[<span class="string">'title'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> +test[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment"># 定义停用词,去掉出现较多,但对文章不关键的词语</span></span><br><span class="line">stops =[i.strip() <span class="keyword">for</span> i <span class="keyword">in</span> open(<span class="string">r'stop.txt'</span>,encoding=<span class="string">'utf-8'</span>).readlines()]</span><br></pre></td></tr></tbody></table></div></figure>
<p>使用n_gram_range来改变结果候选词的词长大小。例如,如果我们将它设置为(3,3),那么产生的候选词将是包含3个关键词的短语。然后,变量candidates就是一个简单的字符串列表,其中包含了我们的候选关键词或者关键短语。</p>
<h2 id="3-Embeddings"><span class="heading-link">3. Embeddings</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">model = SentenceTransformer(<span class="string">r'xlm-r-distilroberta-base-paraphrase-v1'</span>)</span><br><span class="line"></span><br><span class="line">test_words = []</span><br><span class="line"><span class="keyword">for</span> row <span class="keyword">in</span> test.iterrows():</span><br><span class="line"> <span class="comment"># 读取第每一行数据的标题与摘要并提取关键词</span></span><br><span class="line"> </span><br><span class="line"> n_gram_range = (<span class="number">2</span>,<span class="number">2</span>) <span class="comment">#要考虑的 n-gram 的范围是 2-gram 到 2-gram,也就是只考虑连续两个词组成的序列。</span></span><br><span class="line"> <span class="comment"># 这里我们使用TF-IDF算法来获取候选关键词 </span></span><br><span class="line"> count = TfidfVectorizer(ngram_range=n_gram_range, stop_words=stops).fit([row[<span class="number">1</span>].text]) <span class="comment"># 从一个文本数据集中创建了一个 TF-IDF 特征计数器(feature counter)。</span></span><br><span class="line"> candidates = count.get_feature_names()</span><br><span class="line"> print(candidates)</span><br><span class="line"> <span class="comment"># 将文本标题以及候选关键词/关键短语转换为数值型数据(numerical data)。我们使用BERT来实现这一目的</span></span><br><span class="line"> title_embedding = model.encode([row[<span class="number">1</span>].title])</span><br><span class="line"> </span><br><span class="line"> candidate_embeddings = model.encode(candidates)</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="4-Cosine-Similarity"><span class="heading-link">4. Cosine Similarity</span></h2><p>要找到与文档最相似的候选词汇或者短语。假设与文档最相似的候选词汇/短语,是能较好的表示文档的关键词/关键短语。为了计算候选者和文档之间的相似度,将使用向量之间的余弦相似度,因为它在高维度下表现得相当好。</p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 通过修改这个参数来更改关键词数量</span></span><br><span class="line">top_n = <span class="number">35</span></span><br><span class="line"><span class="comment"># 利用文章标题进一步提取关键词</span></span><br><span class="line">distances = cosine_similarity(title_embedding, candidate_embeddings)</span><br><span class="line">keywords = [candidates[index] <span class="keyword">for</span> index <span class="keyword">in</span> distances.argsort()[<span class="number">0</span>][-top_n:]]</span><br><span class="line"></span><br><span class="line"><span class="keyword">if</span> len( keywords) == <span class="number">0</span>:</span><br><span class="line"> keywords = [<span class="string">'A'</span>, <span class="string">'B'</span>]</span><br><span class="line">test_words.append(<span class="string">'; '</span>.join( keywords))</span><br></pre></td></tr></tbody></table></div></figure>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">print(keywords)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">输出结果:</span></span><br><span class="line"><span class="string">['monomers roasting', 'ara monomers', 'enzyme linked', 'degranulation basophils', </span></span><br><span class="line"><span class="string">'matrix amount', 'total proteins', 'stimulate degranulation', 'roasting ara', </span></span><br><span class="line"><span class="string">'allergenicity increase', 'structure ara', 'allergenicity cross', 'allergenicity change',</span></span><br><span class="line"><span class="string">'proteins iac', 'addition methylation', 'processing roasting', 'food allergy', </span></span><br><span class="line"><span class="string">'derivatives roasting', 'ara roasted', 'ara matrix', 'processing structure', </span></span><br><span class="line"><span class="string">'reflect allergenicity', 'oxidation modification', 'allergenicity ara', </span></span><br><span class="line"><span class="string">'blotting enzyme', 'reduce allergenicity', 'potential allergenicity', </span></span><br><span class="line"><span class="string">'terms allergenicity', 'roasted matrix', 'peanut allergy', 'matrix peanut', </span></span><br><span class="line"><span class="string">'methylation oxidation', 'structure allergenicity', 'allergenicity processing', </span></span><br><span class="line"><span class="string">'allergenicity peanut', 'peanut allergen']</span></span><br><span class="line"><span class="string"></span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></tbody></table></div></figure>
<p>所有的关键词/短语都是如此的相似,所以可以考虑结果的多样化策略。</p>
<h2 id="5-Diversification"><span class="heading-link">5. Diversification</span></h2><p>结果的多样化需要在关键词/关键短语的准确性(accuracy)和它们之间的多样性(diversity)之间取得一个微妙的平衡(a delicate balance)。使用两种算法来实现结果的多样化。<strong>可参考:</strong><span class="external-link"><a href="https://www.heywhale.com/mw/project/5fe7457e5e24ed0030239a11" target="_blank" rel="noopener"><strong>基于上下文语境的文档关键词提取</strong></a><i class="fa fa-external-link"></i></span></p>
<ul>
<li>Max Sum Similarity(最大相似度)</li>
<li>Maximal Marginal Relevance(最大边际相关性)</li>
</ul>
<h3 id="5-1-Max-Sum-Similarity(最大相似度)"><span class="heading-link">5.1 Max Sum Similarity(最大相似度)</span></h3><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"><span class="keyword">import</span> itertools</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">max_sum_sim</span><span class="params">(doc_embedding, word_embeddings, words, top_n, nr_candidates)</span>:</span></span><br><span class="line"> <span class="string">'''</span></span><br><span class="line"><span class="string"> 使用余弦相似度计算文档嵌入向量和候选词嵌入向量之间的相似度。</span></span><br><span class="line"><span class="string"> 根据余弦相似度的值,选择具有最高相似度的候选词作为候选集合。</span></span><br><span class="line"><span class="string"> 构建候选词之间相似度的矩阵。</span></span><br><span class="line"><span class="string"> 使用贪心算法,选择使得相似度之和最大化的词组合作为最终的多样化结果。</span></span><br><span class="line"><span class="string"> '''</span></span><br><span class="line"> <span class="comment"># Calculate distances and extract keywords</span></span><br><span class="line"> distances = cosine_similarity(doc_embedding, candidate_embeddings)</span><br><span class="line"> distances_candidates = cosine_similarity(candidate_embeddings, </span><br><span class="line"> candidate_embeddings)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Get top_n words as candidates based on cosine similarity</span></span><br><span class="line"> words_idx = list(distances.argsort()[<span class="number">0</span>][-nr_candidates:])</span><br><span class="line"> words_vals = [candidates[index] <span class="keyword">for</span> index <span class="keyword">in</span> words_idx]</span><br><span class="line"> distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Calculate the combination of words that are the least similar to each other</span></span><br><span class="line"> min_sim = np.inf</span><br><span class="line"> candidate = <span class="literal">None</span></span><br><span class="line"> <span class="keyword">for</span> combination <span class="keyword">in</span> itertools.combinations(range(len(words_idx)), top_n):</span><br><span class="line"> sim = sum([distances_candidates[i][j] <span class="keyword">for</span> i <span class="keyword">in</span> combination <span class="keyword">for</span> j <span class="keyword">in</span> combination <span class="keyword">if</span> i != j])</span><br><span class="line"> <span class="keyword">if</span> sim < min_sim:</span><br><span class="line"> candidate = combination</span><br><span class="line"> min_sim = sim</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> [words_vals[idx] <span class="keyword">for</span> idx <span class="keyword">in</span> candidate]</span><br></pre></td></tr></tbody></table></div></figure>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">max_sum_sim(doc_embedding=title_embedding, </span><br><span class="line"> word_embeddings=candidate_embeddings, </span><br><span class="line"> words=candidates, </span><br><span class="line"> top_n=<span class="number">10</span>, </span><br><span class="line"> nr_candidates=<span class="number">10</span>)</span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">输出结果:</span></span><br><span class="line"><span class="string">['potential allergenicity',</span></span><br><span class="line"><span class="string"> 'terms allergenicity',</span></span><br><span class="line"><span class="string"> 'roasted matrix',</span></span><br><span class="line"><span class="string"> 'peanut allergy',</span></span><br><span class="line"><span class="string"> 'matrix peanut',</span></span><br><span class="line"><span class="string"> 'methylation oxidation',</span></span><br><span class="line"><span class="string"> 'structure allergenicity',</span></span><br><span class="line"><span class="string"> 'allergenicity processing',</span></span><br><span class="line"><span class="string"> 'allergenicity peanut',</span></span><br><span class="line"><span class="string"> 'peanut allergen']</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"></span><br><span class="line"></span><br><span class="line">max_sum_sim(doc_embedding=title_embedding, </span><br><span class="line"> word_embeddings=candidate_embeddings, </span><br><span class="line"> words=candidates, </span><br><span class="line"> top_n=<span class="number">10</span>, </span><br><span class="line"> nr_candidates=<span class="number">20</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">输出结果:</span></span><br><span class="line"><span class="string">['derivatives roasting',</span></span><br><span class="line"><span class="string"> 'ara roasted',</span></span><br><span class="line"><span class="string"> 'ara matrix',</span></span><br><span class="line"><span class="string"> 'processing structure',</span></span><br><span class="line"><span class="string"> 'oxidation modification',</span></span><br><span class="line"><span class="string"> 'reduce allergenicity',</span></span><br><span class="line"><span class="string"> 'potential allergenicity',</span></span><br><span class="line"><span class="string"> 'peanut allergy',</span></span><br><span class="line"><span class="string"> 'matrix peanut',</span></span><br><span class="line"><span class="string"> 'allergenicity processing']</span></span><br><span class="line"><span class="string"> </span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></tbody></table></div></figure>
<p><strong>较高的nr_candidates值会创造出更多样化的关键词/关键短语,但这并不能很好地代表文档。</strong></p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">test_words = []</span><br><span class="line"><span class="keyword">for</span> row <span class="keyword">in</span> test.iterrows():</span><br><span class="line"> <span class="comment"># 读取第每一行数据的标题与摘要并提取关键词</span></span><br><span class="line"> </span><br><span class="line"> n_gram_range = (<span class="number">2</span>,<span class="number">2</span>) <span class="comment">#要考虑的 n-gram 的范围是 2-gram 到 2-gram,也就是只考虑连续两个词组成的序列。</span></span><br><span class="line"> <span class="comment"># 这里我们使用TF-IDF算法来获取候选关键词 </span></span><br><span class="line"> count = TfidfVectorizer(ngram_range=n_gram_range, stop_words=stops).fit([row[<span class="number">1</span>].text]) <span class="comment"># 从一个文本数据集中创建了一个 TF-IDF 特征计数器(feature counter)。</span></span><br><span class="line"> candidates = count.get_feature_names()</span><br><span class="line"> print(candidates)</span><br><span class="line"> <span class="comment"># 将文本标题以及候选关键词/关键短语转换为数值型数据(numerical data)。我们使用BERT来实现这一目的</span></span><br><span class="line"> title_embedding = model.encode([row[<span class="number">1</span>].title])</span><br><span class="line"> </span><br><span class="line"> candidate_embeddings = model.encode(candidates)</span><br><span class="line"> </span><br><span class="line"> <span class="comment"># 通过修改这个参数来更改关键词数量</span></span><br><span class="line"> top_n = <span class="number">35</span></span><br><span class="line"> <span class="comment"># 利用文章标题进一步提取关键词.</span></span><br><span class="line"> <span class="comment">###########################################################################</span></span><br><span class="line"> keywords = max_sum_sim(doc_embedding=title_embedding, </span><br><span class="line"> word_embeddings=candidate_embeddings, </span><br><span class="line"> words=candidates, </span><br><span class="line"> top_n=<span class="number">10</span>, </span><br><span class="line"> nr_candidates=<span class="number">10</span>)</span><br><span class="line"> <span class="comment">###########################################################################</span></span><br><span class="line"><span class="comment"># distances = cosine_similarity(title_embedding, candidate_embeddings)</span></span><br><span class="line"><span class="comment"># keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]</span></span><br><span class="line"> </span><br><span class="line"> <span class="keyword">if</span> len( keywords) == <span class="number">0</span>:</span><br><span class="line"> keywords = [<span class="string">'A'</span>, <span class="string">'B'</span>]</span><br><span class="line"> test_words.append(<span class="string">'; '</span>.join( keywords))</span><br></pre></td></tr></tbody></table></div></figure>
<h3 id="5-2-Maximal-Marginal-Relevance(最大边际相关性)"><span class="heading-link">5.2 Maximal Marginal Relevance(最大边际相关性)</span></h3><p><strong>最大边际相关性试图在文本摘要任务中最小化冗余(minimize redundancy)和最大化结果的多样性。</strong></p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">import</span> numpy <span class="keyword">as</span> np</span><br><span class="line"></span><br><span class="line"><span class="function"><span class="keyword">def</span> <span class="title">mmr</span><span class="params">(doc_embedding, word_embeddings, words, top_n, diversity)</span>:</span></span><br><span class="line"> <span class="string">'''</span></span><br><span class="line"><span class="string"> 使用余弦相似度计算每个候选词与文档嵌入向量的相似度以及候选词之间的相似度。</span></span><br><span class="line"><span class="string"> 初始化已选择的关键词列表,首先选择与文档嵌入向量相似度最高的候选词作为第一个关键词。</span></span><br><span class="line"><span class="string"> 根据要选择的关键词数量 top_n 进行循环迭代,每次选择与已选择关键词之间边际相关性最大的候选词作为下一个关键词。</span></span><br><span class="line"><span class="string"> 更新已选择的关键词列表和候选词列表。</span></span><br><span class="line"><span class="string"> '''</span></span><br><span class="line"> <span class="comment"># Extract similarity within words, and between words and the document</span></span><br><span class="line"> word_doc_similarity = cosine_similarity(word_embeddings, doc_embedding)</span><br><span class="line"> word_similarity = cosine_similarity(word_embeddings)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Initialize candidates and already choose best keyword/keyphras</span></span><br><span class="line"> keywords_idx = [np.argmax(word_doc_similarity)]</span><br><span class="line"> candidates_idx = [i <span class="keyword">for</span> i <span class="keyword">in</span> range(len(words)) <span class="keyword">if</span> i != keywords_idx[<span class="number">0</span>]]</span><br><span class="line"></span><br><span class="line"> <span class="keyword">for</span> _ <span class="keyword">in</span> range(top_n - <span class="number">1</span>):</span><br><span class="line"> <span class="comment"># Extract similarities within candidates and</span></span><br><span class="line"> <span class="comment"># between candidates and selected keywords/phrases</span></span><br><span class="line"> candidate_similarities = word_doc_similarity[candidates_idx, :]</span><br><span class="line"> target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=<span class="number">1</span>)</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Calculate MMR</span></span><br><span class="line"> mmr = (<span class="number">1</span>-diversity) * candidate_similarities - diversity * target_similarities.reshape(<span class="number">-1</span>, <span class="number">1</span>)</span><br><span class="line"> mmr_idx = candidates_idx[np.argmax(mmr)]</span><br><span class="line"></span><br><span class="line"> <span class="comment"># Update keywords & candidates</span></span><br><span class="line"> keywords_idx.append(mmr_idx)</span><br><span class="line"> candidates_idx.remove(mmr_idx)</span><br><span class="line"></span><br><span class="line"> <span class="keyword">return</span> [words[idx] <span class="keyword">for</span> idx <span class="keyword">in</span> keywords_idx]</span><br></pre></td></tr></tbody></table></div></figure>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">mmr(doc_embedding=title_embedding, </span><br><span class="line"> word_embeddings=candidate_embeddings,</span><br><span class="line"> words=candidates, </span><br><span class="line"> top_n=<span class="number">20</span>, </span><br><span class="line"> diversity=<span class="number">0.2</span>)</span><br><span class="line"></span><br><span class="line"><span class="comment">#---------------------------------------------------------------</span></span><br><span class="line"></span><br><span class="line">mmr(doc_embedding=title_embedding, </span><br><span class="line"> word_embeddings=candidate_embeddings,</span><br><span class="line"> words=candidates, </span><br><span class="line"> top_n=<span class="number">20</span>, </span><br><span class="line"> diversity=<span class="number">0.8</span>)</span><br><span class="line"></span><br><span class="line"><span class="string">''''</span></span><br><span class="line"><span class="string">同样的,较高的多样性数值会生成非常多样化的关键词/关键短语</span></span><br><span class="line"><span class="string">'''</span></span><br></pre></td></tr></tbody></table></div></figure>
<h1 id="参考"><span class="heading-link">参考</span></h1><p>[1]<span class="external-link"><a href="https://datawhaler.feishu.cn/docx/EVoodR6WroWZxXxa3a0cukIanRO" target="_blank" rel="noopener"> AI夏令营 - NLP实践教程</a><i class="fa fa-external-link"></i></span><br>[2] <span class="external-link"><a href="https://www.heywhale.com/mw/project/5fe7457e5e24ed0030239a11" target="_blank" rel="noopener">基于上下文语境的文档关键词提取</a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>【第一期AI夏令营丨自然语言处理】</category>
</categories>
<tags>
<tag>NLP</tag>
</tags>
</entry>
<entry>
<title>【第一期AI夏令营丨自然语言处理】任务一:文本二分类_&&Baseline代码分析及改进</title>
<url>/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E4%BB%BB%E5%8A%A1%E4%B8%80%EF%BC%9A%E6%96%87%E6%9C%AC%E4%BA%8C%E5%88%86%E7%B1%BB-Baseline%E4%BB%A3%E7%A0%81%E5%88%86%E6%9E%90%E5%8F%8A%E6%94%B9%E8%BF%9B/</url>
<content><![CDATA[<p><a name="tZJcy"></a></p>
<h2 id="一、问题分析"><span class="heading-link">一、问题分析</span></h2><p><strong>从论文标题、摘要作者等信息,判断该论文是否属于医学领域的文献。</strong><br>针对文本分类任务,可以提供两种实践思路,一种是使用传统的特征提取方法(如TF-IDF/BOW)结合机器学习模型,另一种是使用预训练的BERT模型进行建模。使用特征提取 + 机器学习的思路步骤如下:</p>
<ol>
<li><strong>数据预处理</strong> :首先,对文本数据进行预处理,包括文本清洗(如去除特殊字符、标点符号)、分词等操作。可以使用常见的NLP工具包(如NLTK或spaCy)来辅助进行预处理。</li>
<li><strong>特征提取</strong>:使用TF-IDF(词频-逆文档频率)或BOW(词袋模型)方法将文本转换为向量表示。TF-IDF可以计算文本中词语的重要性,而BOW则简单地统计每个词语在文本中的出现次数。可以使用scikit-learn库的TfidfVectorizer或CountVectorizer来实现特征提取。</li>
<li><strong>构建训练集和测试集</strong>:将预处理后的文本数据分割为训练集和测试集,确保数据集的样本分布均匀。</li>
<li><strong>选择机器学习模型</strong>:根据实际情况选择适合的机器学习模型,如朴素贝叶斯、支持向量机(SVM)、随机森林等。这些模型在文本分类任务中表现良好。可以使用scikit-learn库中相应的分类器进行模型训练和评估。</li>
<li><strong>模型训练和评估</strong>:使用训练集对选定的机器学习模型进行训练,然后使用测试集进行评估。评估指标可以选择准确率、精确率、召回率、F1值等。</li>
<li><strong>调参优化</strong>:如果模型效果不理想,可以尝试调整特征提取的参数(如词频阈值、词袋大小等)或机器学习模型的参数,以获得更好的性能。</li>
</ol>
<p><a name="GflxY"></a></p>
<h2 id="二、Baseline代码分析"><span class="heading-link">二、Baseline代码分析</span></h2><figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 导入pandas用于读取表格数据</span></span><br><span class="line"><span class="keyword">import</span> pandas <span class="keyword">as</span> pd</span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入BOW(词袋模型),可以选择将CountVectorizer替换为TfidfVectorizer(TF-IDF(词频-逆文档频率)),注意上下文要同时修改,亲测后者效果更佳</span></span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> CountVectorizer</span><br><span class="line"></span><br><span class="line"><span class="comment"># 导入LogisticRegression回归模型</span></span><br><span class="line"><span class="keyword">from</span> sklearn.linear_model <span class="keyword">import</span> LogisticRegression</span><br><span class="line"></span><br><span class="line"><span class="comment"># 过滤警告消息</span></span><br><span class="line"><span class="keyword">from</span> warnings <span class="keyword">import</span> simplefilter</span><br><span class="line"><span class="keyword">from</span> sklearn.exceptions <span class="keyword">import</span> ConvergenceWarning</span><br><span class="line">simplefilter(<span class="string">"ignore"</span>, category=ConvergenceWarning)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 读取数据集</span></span><br><span class="line">train = pd.read_csv(<span class="string">'/home/aistudio/data/data231041/train.csv'</span>)</span><br><span class="line">train[<span class="string">'title'</span>] = train[<span class="string">'title'</span>].fillna(<span class="string">''</span>) <span class="comment"># 缺失值填充为一个空字符串</span></span><br><span class="line">train[<span class="string">'abstract'</span>] = train[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line">test = pd.read_csv(<span class="string">'/home/aistudio/data/data231041/test.csv'</span>)</span><br><span class="line">test[<span class="string">'title'</span>] = test[<span class="string">'title'</span>].fillna(<span class="string">''</span>)</span><br><span class="line">test[<span class="string">'abstract'</span>] = test[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 提取文本特征,生成训练集与测试集</span></span><br><span class="line">train[<span class="string">'text'</span>] = train[<span class="string">'title'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + train[<span class="string">'author'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + train[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)+ <span class="string">' '</span> + train[<span class="string">'Keywords'</span>].fillna(<span class="string">''</span>) <span class="comment"># 合并成一行数据</span></span><br><span class="line">test[<span class="string">'text'</span>] = test[<span class="string">'title'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + test[<span class="string">'author'</span>].fillna(<span class="string">''</span>) + <span class="string">' '</span> + test[<span class="string">'abstract'</span>].fillna(<span class="string">''</span>)+ <span class="string">' '</span> + train[<span class="string">'Keywords'</span>].fillna(<span class="string">''</span>)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">vector = CountVectorizer().fit(train['text']): 它使用了 CountVectorizer 对象对文本数据进行了分词、去除停用词、统计词频等操作,将每个文本转换为一个向量。</span></span><br><span class="line"><span class="string">最后,使用 fit 方法根据训练数据构建词表,并将其保存在 CountVectorizer 对象中,以便后续使用。</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">vector = CountVectorizer().fit(train[<span class="string">'text'</span>]) <span class="comment"># 输出: CountVectorizer()</span></span><br><span class="line"><span class="comment"># print (vector)</span></span><br><span class="line">train_vector = vector.transform(train[<span class="string">'text'</span>]) <span class="comment"># 每个文本转换为一个向量。这里使用了 transform 方法,是为了将训练数据转换为向量形式</span></span><br><span class="line"><span class="keyword">print</span> (<span class="string">'train_vector'</span>,type(train_vector),train_vector) <span class="comment"># 输出结果: (0, 2345) 2 意思是 (0, 2345) 2 表示第一个文本中词表中编号为 2345 的单词出现了两次。</span></span><br><span class="line"></span><br><span class="line">feature_names = vector.get_feature_names_out() <span class="comment"># 获取词表中每个单词的名称,然后查找对应编号的单词。</span></span><br><span class="line">print(<span class="string">'feature_names'</span>,feature_names[<span class="number">2345</span>]) <span class="comment"># 查找2345编号的单词为access,在第一个文本中出现两次。</span></span><br></pre></td></tr></tbody></table></div></figure>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">输出结果如下</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">train_vector <<span class="class"><span class="keyword">class</span> '<span class="title">scipy</span>.<span class="title">sparse</span>.<span class="title">_csr</span>.<span class="title">csr_matrix</span>'> <span class="params">(<span class="number">0</span>, <span class="number">1469</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">2345</span>)</span> 2 # "<span class="params">(<span class="number">0</span>, <span class="number">2345</span>)</span> 2 "表示第一个文本中词表中编号为 2345 的单词出现了两次。</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">2348</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">2349</span>)</span> 4</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">3869</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">4268</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">4382</span>)</span> 15</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5112</span>)</span> 3</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5290</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5505</span>)</span> 5</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5575</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5585</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5586</span>)</span> 4</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">5891</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">6096</span>)</span> 3</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">6106</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">6224</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">6545</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">7019</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">7233</span>)</span> 1</span></span><br><span class="line"><span class="class"> <span class="params">(<span class="number">0</span>, <span class="number">8463</span>)</span> 3</span></span><br><span class="line"><span class="class"> :</span> :</span><br><span class="line"> (<span class="number">5999</span>, <span class="number">66915</span>) <span class="number">1</span></span><br><span class="line"> (<span class="number">5999</span>, <span class="number">67033</span>) <span class="number">1</span></span><br><span class="line">feature_names access <span class="comment"># # 查找2345编号的单词为access,在第一个文本中出现两次。</span></span><br></pre></td></tr></tbody></table></div></figure>
<p><strong>于是我去数据集中的第一个文本中进行验证发现,access确实是出现了两次(如下图所示)。</strong></p>
<p><img src="https://img-blog.csdnimg.cn/img_convert/f5617264b160ea92a0dccb2d31f08ff1.png#averageHue=#f3eee7&clientId=u41e8680b-ee9d-4&from=paste&height=429&id=u368707f7&originHeight=429&originWidth=797&originalType=binary&ratio=1&rotation=0&showTitle=false&size=37609&status=done&style=none&taskId=u69171e0b-ba32-404d-b9d3-21033749c39&title=&width=797" alt="image.png"></p>
<p><strong>引入模型及标签预测:</strong></p>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">test_vector = vector.transform(test[<span class="string">'text'</span>])</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="comment"># 引入模型</span></span><br><span class="line">model = LogisticRegression()</span><br><span class="line"></span><br><span class="line"><span class="comment"># 开始训练,这里可以考虑修改默认的batch_size与epoch来取得更好的效果</span></span><br><span class="line">model.fit(train_vector, train[<span class="string">'label'</span>])</span><br><span class="line"></span><br><span class="line"><span class="comment"># 利用模型对测试集label标签进行预测</span></span><br><span class="line">test[<span class="string">'label'</span>] = model.predict(test_vector)</span><br><span class="line">print(<span class="string">"test['label']"</span>,test[<span class="string">'label'</span>])</span><br><span class="line">print(<span class="string">'test\n:'</span>,test)</span><br><span class="line"><span class="comment"># 生成任务一推测结果</span></span><br><span class="line">test[[<span class="string">'uuid'</span>, <span class="string">'Keywords'</span>, <span class="string">'label'</span>]].to_csv(<span class="string">'submit_task1.csv'</span>, index=<span class="literal">None</span>)</span><br></pre></td></tr></tbody></table></div></figure>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="string">'''</span></span><br><span class="line"><span class="string">输出结果如下:</span></span><br><span class="line"><span class="string">'''</span></span><br><span class="line">test[<span class="string">'label'</span>] <span class="number">0</span> <span class="number">1</span></span><br><span class="line"><span class="number">1</span> <span class="number">0</span></span><br><span class="line"><span class="number">2</span> <span class="number">0</span></span><br><span class="line"><span class="number">3</span> <span class="number">0</span></span><br><span class="line"><span class="number">4</span> <span class="number">0</span></span><br><span class="line"> ..</span><br><span class="line"><span class="number">2353</span> <span class="number">1</span></span><br><span class="line"><span class="number">2354</span> <span class="number">0</span></span><br><span class="line"><span class="number">2355</span> <span class="number">0</span></span><br><span class="line"><span class="number">2356</span> <span class="number">0</span></span><br><span class="line"><span class="number">2357</span> <span class="number">1</span></span><br><span class="line">Name: label, Length: <span class="number">2358</span>, dtype: int64</span><br><span class="line">test:</span><br><span class="line">uuid title author abstract Keywords text label </span><br><span class="line"><span class="number">0</span> <span class="number">0</span> Monitoring Changes <span class="keyword">in</span> ... </span><br><span class="line"><span class="number">1</span> <span class="number">1</span> Source Printer Classification ... ... ... ... ... </span><br><span class="line"><span class="number">2</span> <span class="number">2</span> Plasma-processed CoSn/RGO </span><br><span class="line"><span class="number">3</span> <span class="number">3</span> Immediate Antiretroviral ... </span><br><span class="line"><span class="number">4</span> <span class="number">4</span> Design <span class="keyword">and</span> analysis of an ...</span><br><span class="line">[<span class="number">2358</span> rows x <span class="number">7</span> columns]</span><br></pre></td></tr></tbody></table></div></figure>
<p><a name="MuGq1"></a></p>
<h2 id="三、代码修改"><span class="heading-link">三、代码修改</span></h2><p>在跑通Baseline代码后,得到的分数为:<strong>0.99384</strong>。 以下是尝试的几种修改策略。</p>
<ol>
<li><p><strong>将CountVectorizer替换为TfidfVectorizer</strong></p>
<p>未设置参数时得到的评分为:<strong>0.97655</strong><br>设置参数后得到的评分为:<strong>0.98171</strong></p>
</li>
</ol>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="comment"># 使用TfidfVectorizer进行文本向量化</span></span><br><span class="line"><span class="keyword">from</span> sklearn.feature_extraction.text <span class="keyword">import</span> TfidfVectorizer</span><br><span class="line"></span><br><span class="line"><span class="comment"># 定义TfidfVectorizer,设置参数,例如调整ngram范围,最大特征数等</span></span><br><span class="line"><span class="comment"># vector = TfidfVectorizer().fit(train['text']) # 未设置参数时得到的评分为:0.97655 </span></span><br><span class="line">vector = TfidfVectorizer(ngram_range=(<span class="number">1</span>, <span class="number">2</span>), max_features=<span class="number">5000</span>).fit(train[<span class="string">'text'</span>]) <span class="comment"># 设置参数后得到的评分为:0.98171</span></span><br><span class="line">train_vector = vector.transform(train[<span class="string">'text'</span>])</span><br><span class="line">test_vector = vector.transform(test[<span class="string">'text'</span>])</span><br></pre></td></tr></tbody></table></div></figure>
<ol start="2">
<li><p><strong>尝试使用SVM模型</strong></p>
<p>使用SVM模型得到的评分为:<strong>0.99489</strong></p>
</li>
</ol>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.svm <span class="keyword">import</span> SVC</span><br><span class="line"></span><br><span class="line"><span class="comment"># 尝试使用SVM模型</span></span><br><span class="line">model = SVC(kernel=<span class="string">'linear'</span>)</span><br><span class="line">model.fit(train_vector, train[<span class="string">'label'</span>])</span><br><span class="line"></span><br><span class="line"><span class="comment"># 进行预测</span></span><br><span class="line">test[<span class="string">'label'</span>] = model.predict(test_vector)</span><br></pre></td></tr></tbody></table></div></figure>
<ol start="3">
<li><p><strong>尝试使用随机森林模型</strong></p>
<p>使用随机森林模型得到的评分为:<strong>0.98995</strong></p>
</li>
</ol>
<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line"><span class="keyword">from</span> sklearn.ensemble <span class="keyword">import</span> RandomForestClassifier</span><br><span class="line"></span><br><span class="line"><span class="comment"># 尝试使用随机森林模型</span></span><br><span class="line">model = RandomForestClassifier(n_estimators=<span class="number">100</span>)</span><br><span class="line">model.fit(train_vector, train[<span class="string">'label'</span>])</span><br><span class="line"></span><br><span class="line"><span class="comment"># 进行预测</span></span><br><span class="line">test[<span class="string">'label'</span>] = model.predict(test_vector)</span><br></pre></td></tr></tbody></table></div></figure>
<h2 id="【参考】"><span class="heading-link">【参考】</span></h2><p><span class="external-link"><a href="https://aistudio.baidu.com/aistudio/projectdetail/6522950?sUid=377372&shared=1&ts=1689827255213" target="_blank" rel="noopener">手把手打一场NLP赛事</a><i class="fa fa-external-link"></i></span><br>赛事链接:<span class="external-link"><a href="https://challenge.xfyun.cn/topic/info?type=abstract-of-the-paper" target="_blank" rel="noopener">基于论文摘要的文本分类与关键词抽取挑战赛</a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>【第一期AI夏令营丨自然语言处理】</category>
</categories>
<tags>
<tag>NLP</tag>
</tags>
</entry>
<entry>
<title>【第一期AI夏令营丨自然语言处理】赛事信息</title>
<url>/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%B5%9B%E4%BA%8B%E4%BF%A1%E6%81%AF/</url>
<content><![CDATA[<p>### 基于论文摘要的文本分类与关键词抽取挑战赛 :<span class="external-link"><a href="https://challenge.xfyun.cn/topic/info?type=abstract-of-the-paper&ch=ZuoaKcY" target="_blank" rel="noopener">https://challenge.xfyun.cn/topic/info?type=abstract-of-the-paper&ch=ZuoaKcY</a><i class="fa fa-external-link"></i></span></p>
<p><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%B5%9B%E4%BA%8B%E4%BF%A1%E6%81%AF/1.png" alt=""></p>
<h1 id="赛题解析"><span class="heading-link">赛题解析</span></h1><p>本任务分为两个子任务:<br><strong>1. 从论文标题、摘要作者等信息,判断该论文是否属于医学领域的文献。</strong><br><strong>2. 从论文标题、摘要作者等信息,提取出该论文关键词。</strong></p>
<p><strong>第一个任务</strong>看作是一个<strong>文本二分类任务。</strong>机器需要根据对论文摘要等信息的理解,将论文划分为医学领域的文献和非医学领域的文献两个类别之一<strong>。第二个任务</strong>看作是一个<strong>文本关键词识别任务</strong>。机器需要从给定的论文中识别和提取出与论文内容相关的关键词。<br>*<em>train.csv *</em>部分信息如下图:<br><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%B5%9B%E4%BA%8B%E4%BF%A1%E6%81%AF/2.png" alt="image.png"></p>
<p>*<em>test.csv *</em>部分信息如下图:<br><img src="/2023/10/24/%E3%80%90%E7%AC%AC%E4%B8%80%E6%9C%9FAI%E5%A4%8F%E4%BB%A4%E8%90%A5%E4%B8%A8%E8%87%AA%E7%84%B6%E8%AF%AD%E8%A8%80%E5%A4%84%E7%90%86%E3%80%91%E8%B5%9B%E4%BA%8B%E4%BF%A1%E6%81%AF/3.png" alt="image.png"></p>
<h2 id="数据集解析"><span class="heading-link">数据集解析</span></h2><p><br>训练集与测试集数据为CSV格式文件,各字段分别是标题、作者和摘要。Keywords为任务2的标签,label为任务1的标签。训练集和测试集都可以通过pandas读取。</p>
<h2 id="【参考】"><span class="heading-link">【参考】</span></h2><p><span class="external-link"><a href="https://aistudio.baidu.com/aistudio/projectdetail/6522950?sUid=377372&shared=1&ts=1689827255213" target="_blank" rel="noopener">手把手打一场NLP赛事</a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>【第一期AI夏令营丨自然语言处理】</category>
</categories>
<tags>
<tag>NLP</tag>
</tags>
</entry>
<entry>
<title>带你用油猴插件免费观看各大平台(B站、爱奇艺等)VIP视频</title>
<url>/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/</url>
<content><![CDATA[<p>## 前言</p>
<p>这里使用谷歌浏览器来操作示范,其他浏览器也是类似操作。很多朋友无法使用谷歌商店,这里提供两个好用方便的网站下载插件:</p>
<ol>
<li><span class="external-link"><a href="https://chrome.zzzmh.cn/index#ext" target="_blank" rel="noopener">极简插件</a><i class="fa fa-external-link"></i></span></li>
<li><span class="external-link"><a href="https://www.crxsoso.com/" target="_blank" rel="noopener">Crx搜搜</a><i class="fa fa-external-link"></i></span> </li>
</ol>
<h2 id="安装步骤"><span class="heading-link">安装步骤</span></h2><p>废话不多说直接进入主题。</p>
<h3 id="1-第一步:安装油猴插件"><span class="heading-link">1. 第一步:安装油猴插件</span></h3><p>在上面提供的两个链接中搜索<code>Tampermonkey</code></p>
<p>[<img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/piEK4Tx.md.png" alt="piEK4Tx.md.png"><br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/8ac52c3f87f845fa8a2ad652de704849.png" alt=""><br>如果安装失败,无法拖动到扩展程序中,这里可以下载<code>Chrome伴侣</code></p>
<p><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/87287d12a06949ab83c68a2d81493677.png" alt="在这里插入图片描述"><br>选择插件位置,开始安装即可,亲测有效。</p>
<h3 id="2-第二步:安装油猴插件"><span class="heading-link">2. 第二步:安装油猴插件</span></h3><p>下载安装成功后,开始使用油猴插件下载脚本。<br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/f022d1da2fba43f8be229873387212a3.png" alt="在这里插入图片描述"><br>然后开始再搜索框搜索你所需要的脚本<br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/1845866b322f4374949a2e9ee3e98078.png" alt="在这里插入图片描述"><br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/8fb7147b2360450b899dd2ec8a83e7d0.png" alt="在这里插入图片描述"><br>可以下载不同的脚本,如果不好用的话。<br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/4e895f99b61c48f48b6fe227a35a26cd.png" alt="在这里插入图片描述"><br>点击安装即可</p>
<h3 id="2-第三步:-使用脚本"><span class="heading-link">2. 第三步:*使用脚本</span></h3><p>原本需要VIP才能观看,现在我们点击左边的VIP脚本。<br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/703a88389a584592a32b9e53e0ded897.png" alt="在这里插入图片描述"><br><img src="/2023/10/24/%E5%B8%A6%E4%BD%A0%E7%94%A8%E6%B2%B9%E7%8C%B4%E6%8F%92%E4%BB%B6%E5%85%8D%E8%B4%B9%E8%A7%82%E7%9C%8B%E5%90%84%E5%A4%A7%E5%B9%B3%E5%8F%B0%EF%BC%88B%E7%AB%99%E3%80%81%E7%88%B1%E5%A5%87%E8%89%BA%E7%AD%89%EF%BC%89VIP%E8%A7%86%E9%A2%91/670d91853e8448b2aeaf1b3a7c86c55c.png" alt="在这里插入图片描述"><br>如果一个端口不行,就换下一个端口。</p>
<h2 id="最后"><span class="heading-link">最后</span></h2><p>有很多好用的脚本和插件,值得大家继续探索!真的太好用啦!!!</p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>小技巧</category>
</categories>
<tags>
<tag>资源</tag>
</tags>
</entry>
<entry>
<title>3 线性神经网络</title>
<url>/2023/03/23/3%20%E7%BA%BF%E6%80%A7%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C/</url>
<content><![CDATA[<h1 id="3-1-线性回归"><span class="heading-link">3.1 线性回归</span></h1><h2 id="小结"><span class="heading-link">小结</span></h2><ol>
<li><strong>由果到因</strong>,根据已经发生的观测结果去猜想真实参数,这个过程叫做<strong>估计</strong>;估计正确的可能性叫做<strong>似然性</strong>。求可能性最大的推测,这个过程就是<strong>极大似然估计</strong>。由因到果,根据真实参数(或已经发生的观测结果)去推测未来的观测结果,这个过程叫做<strong>预测</strong>;预测正确的可能性叫做<strong>概率</strong>。</li>
<li><strong>损失函数</strong>:能够量化目标的实际值与预测值之间的差距。</li>
<li><strong>解析解</strong>:可以用一个公式简单地表达出来。</li>
<li><strong>随机梯度下降</strong>:通常会在每次需要计算更新的时候随机抽取一小批样本, 这种变体叫做小批量随机梯度下降(minibatch stochastic gradient descent)</li>
<li>算法的步骤如下:(1)初始化模型参数的值,如随机初始化; (2)从数据集中随机抽取小批量样本且在负梯度的方向上更新参数,并不断迭代这一步骤。</li>
<li><strong>超参数</strong>:可以调整但不在训练过程中更新的参数。<strong>调参</strong>:选择超参数的过程。</li>
<li><strong>泛化</strong>(generalization):找到一组参数,这组参数能够在我们从未见过的数据上实现较低的损失。</li>
</ol>
<h2 id="练习"><span class="heading-link">练习</span></h2><ol>
<li>假设我们有一些数据$x_1, \ldots, x_n \in \mathbb{R}$。我们的目标是找到一个常数$b$,使得最小化$\sum_i (x_i - b)^2$。 <ol>
<li>找到最优值$b$的解析解。</li>
<li>这个问题及其解与正态分布有什么关系?</li>
</ol>
</li>
</ol>
<p><img src="https://cdn.nlark.com/yuque/0/2023/jpeg/1866697/1679575137426-6355cc99-8deb-4003-9f8b-56ae9e17fcdb.jpeg?x-oss-process=image/auto-orient,1#averageHue=%23c6bfaf&clientId=u257d3d6a-dacf-4&from=paste&height=467&id=udec6df17&name=f220011d9f56d741001e058085bf3d7.jpg&originHeight=700&originWidth=1819&originalType=binary&ratio=1.5&rotation=0&showTitle=false&size=193443&status=done&style=none&taskId=u9eb22437-f7ec-4517-8e17-14d38d51d00&title=&width=1212.6666666666667" alt="f220011d9f56d741001e058085bf3d7.jpg"><br><img src="https://cdn.nlark.com/yuque/0/2023/jpeg/1866697/1679575164638-f9afce37-486c-4994-9ff9-1b9c2a508fb1.jpeg?x-oss-process=image/auto-orient,1#averageHue=%23bcb5a6&clientId=u257d3d6a-dacf-4&from=paste&height=629&id=u2fa4c110&name=4b4a7a7be951be654ca83a2fc8091c2.jpg&originHeight=944&originWidth=1470&originalType=binary&ratio=1.5&rotation=0&showTitle=false&size=228144&status=done&style=none&taskId=u50d9aff1-4102-41ee-b704-25c53806b5b&title=&width=980" alt="4b4a7a7be951be654ca83a2fc8091c2.jpg"></p>
<ol start="2">
<li>推导出使用平方误差的线性回归优化问题的解析解。为了简化问题,可以忽略偏置$b$(我们可以通过向$\mathbf X$添加所有值为1的一列来做到这一点)。 <ol>
<li>用矩阵和向量表示法写出优化问题(将所有数据视为单个矩阵,将所有目标值视为单个向量)。</li>
<li>计算损失对$w$的梯度。</li>
<li>通过将梯度设为0、求解矩阵方程来找到解析解。</li>
<li>什么时候可能比使用随机梯度下降更好?这种方法何时会失效?</li>
</ol>
</li>
</ol>
<p><img src="https://cdn.nlark.com/yuque/0/2023/jpeg/1866697/1679575198670-0cf0dde2-4e9f-4f4f-81bb-d9252d0b2ffc.jpeg?x-oss-process=image/auto-orient,1#averageHue=%23c2bcad&clientId=u257d3d6a-dacf-4&from=paste&height=485&id=u8edc42a5&name=20a3cd63e0e9b71742d6ea41a6c4ad2.jpg&originHeight=727&originWidth=1402&originalType=binary&ratio=1.5&rotation=0&showTitle=false&size=178129&status=done&style=none&taskId=ub288724b-e8bf-461b-8c7e-a2da8d4e4cf&title=&width=934.6666666666666" alt="20a3cd63e0e9b71742d6ea41a6c4ad2.jpg"><br>d: 当模型简单的时候,通过求W的解析解是比随机梯度下降更好,但是当$X^{T}X$不可逆时,无法求出解析解。</p>
<ol start="3">
<li>假定控制附加噪声$\epsilon$的噪声模型是指数分布。也就是说,$p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$ <ol>
<li>写出模型$-\log P(\mathbf y \mid \mathbf X)$下数据的负对数似然。</li>
<li>请试着写出解析解。</li>
<li>提出一种随机梯度下降算法来解决这个问题。哪里可能出错?(提示:当我们不断更新参数时,在驻点附近会发生什么情况)请尝试解决这个问题。</li>
</ol>
</li>
</ol>
<p><img src="https://cdn.nlark.com/yuque/0/2023/jpeg/1866697/1679575242913-4a360559-4f73-4c71-a36b-1da0082833c8.jpeg?x-oss-process=image/auto-orient,1#averageHue=%23c8c1b1&clientId=u257d3d6a-dacf-4&from=paste&height=660&id=u0b242022&name=714017c91023bf8de42d41d88cdb4df.jpg&originHeight=990&originWidth=1432&originalType=binary&ratio=1.5&rotation=0&showTitle=false&size=221078&status=done&style=none&taskId=u4cd9a677-e658-4de3-a627-c9b89af3ca3&title=&width=954.6666666666666" alt="714017c91023bf8de42d41d88cdb4df.jpg"><br>线性绝对值函数在极点处是没有导数的。<br>c: 所求得的损失函数其实是1范数的形式,在驻点处不可导。<br>梯度下降法可能碰到问题: 例如在驻点附近,参数剧烈波动难以收敛,所以可以当损失函数小于一定阈值后,就用2范数代替1范数,即避免了二范数在距离驻点较远时梯度太大训练不稳定,也避免了1范数在驻点附近参数剧烈波动难以收敛。</p>
<h1 id="3-2-线性回归的从零开始实现"><span class="heading-link">3.2 线性回归的从零开始实现</span></h1><h2 id="练习-1"><span class="heading-link">练习</span></h2><ol>
<li>如果我们将权重初始化为零,会发生什么。算法仍然有效吗?</li>
</ol>
<p>在单层网络中,将权重初始化为零时可以的,但是网络层数加深后,在全连接的情况下,在反向传播的时候,由于权重的对称性会导致出现隐藏神经元的对称性,使得多个隐藏神经元的作用就如同1个神经元,算法还是有效的,但是效果不大好。(<span class="external-link"><a href="https://zhuanlan.zhihu.com/p/75879624" target="_blank" rel="noopener">https://zhuanlan.zhihu.com/p/75879624</a><i class="fa fa-external-link"></i></span>)</p>
<ol start="2">
<li>假设试图为电压和电流的关系建立一个模型。自动微分可以用来学习模型的参数吗?</li>
</ol>
<p>可以,建立模型U=IW+b,建立模型U=IW+b</p>
<ol start="3">
<li><p>能基于<span class="external-link"><a href="https://en.wikipedia.org/wiki/Planck%27s_law" target="_blank" rel="noopener">普朗克定律</a><i class="fa fa-external-link"></i></span>使用光谱能量密度来确定物体的温度吗?</p>
</li>
<li><p>计算二阶导数时可能会遇到什么问题?这些问题可以如何解决?</p>
</li>
</ol>
<p>一阶导数的正向计算图无法直接获得,可以通过保存一阶导数的计算图使得可以求二阶导数</p>
<ol start="5">
<li>为什么在<code>squared_loss</code>函数中需要使用<code>reshape</code>函数?</li>
</ol>
<p>以防y^和y,一个是行向量、一个是列向量,使用reshape,可以确保shape一样。</p>
<ol start="6">
<li>尝试使用不同的学习率,观察损失函数值下降的快慢。</li>
</ol>
<p>①学习率过大前期下降很快,但是后面不容易收敛;<br>②学习率过小损失函数下降会很慢。</p>
<ol start="7">
<li>如果样本个数不能被批量大小整除,<code>data_iter</code>函数的行为会有什么变化?</li>
</ol>
<p>出错</p>
<h1 id="3-3-线性回归的简洁实现"><span class="heading-link">3.3 线性回归的简洁实现</span></h1><h2 id="小结-1"><span class="heading-link">小结</span></h2><ol>
<li>Sequential类将多个层串联在一起。当给定输入数据时,Sequential实例将数据传入到第一层, 然后将第一层的输出作为第二层的输入,以此类推。</li>
<li>通过net[0]选择网络中的第一个图层, 然后使用weight.data和bias.data方法访问参数。还可以使用替换方法normal_和fill_来重写参数值。</li>
<li><strong>计算均方误差使用的是MSELoss类,也称为平方𝐿2范数</strong>]。 默认情况下,它返回所有样本损失的平均值。<code>loss = nn.MSELoss()</code></li>
<li><strong>实例化一个SGD实例</strong>)时,我们要指定优化的参数 (可通过net.parameters()从我们的模型中获得)以及优化算法所需的超参数字典。<code>trainer = torch.optim.SGD(net.parameters(), lr=0.03)</code></li>
<li>对于每一个小批量,我们会进行以下步骤:</li>
</ol>
<ul>
<li>通过调用net(X)生成预测并计算损失l(前向传播)。</li>
<li>通过进行反向传播来计算梯度。</li>
<li>通过调用优化器来更新模型参数。</li>
</ul>
<h2 id="练习-2"><span class="heading-link">练习</span></h2><ol>
<li>如果将小批量的总损失替换为小批量损失的平均值,需要如何更改学习率?</li>
</ol>
<p>将学习率除以batchsize。</p>
<ol start="2">
<li>查看深度学习框架文档,它们提供了哪些损失函数和初始化方法?用Huber损失代替原损失,即$l(y,y’) = \begin{cases}|y-y’| -\frac{\sigma}{2} & \text{ if } |y-y’| > \sigma \ \frac{1}{2 \sigma} (y-y’)^2 & \text{ 其它情况}\end{cases}$ </li>
</ol>
<ol start="3">
<li>如何访问线性回归的梯度?<figure class="highlight python"><div class="table-container"><table><tbody><tr><td class="code"><pre><span class="line">net[<span class="number">0</span>].weight.grad</span><br><span class="line">net[<span class="number">0</span>].bias.grad</span><br></pre></td></tr></tbody></table></div></figure>
</li>
</ol>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
</entry>
<entry>
<title>第十章 期中大作业</title>
<url>/2023/03/07/%E7%AC%AC%E5%8D%81%E7%AB%A0%20%E6%9C%9F%E4%B8%AD%E5%A4%A7%E4%BD%9C%E4%B8%9A/</url>
<content><![CDATA[<h2 id="10-1-面试题"><span class="heading-link">10. 1 面试题</span></h2><h3 id="10-1-1-hive外部表和内部表的区别"><span class="heading-link">10.1.1 hive外部表和内部表的区别</span></h3><ul>
<li><strong>内部表:</strong>未被external修饰的是内部表,默认创建的表就是内部表。</li>
<li><strong>外部表:</strong>被external修饰的为外部表。</li>
</ul>
<p><strong>区别:</strong></p>
<p>Hive中的内部表和外部表是不同的类型的表,主要区别在于它们管理数据的方式以及数据存储的位置。</p>
<p>内部表(Internal Table)是在Hive的数据仓库中具有完全控制权的表,数据文件存储在Hive自己的文件系统中。当您创建内部表时,Hive将数据自动存储在其自己的目录结构中,并在表的定义中指定路径。当您删除一个内部表时,Hive会自动删除表的所有数据。</p>
<p>外部表(External Table)则是在Hive之外管理的表。这意味着,外部表中的数据可以实际上存储在您的本地文件系统或Hadoop集群之外的其他文件系统中。当您创建一个外部表时,您需要指定数据文件的位置,如果您删除外部表,数据文件将不会被自动删除,这是外部表和内部表的主要区别。</p>
<p>总的来说,如果您使用Hive来管理数据,并希望控制数据的完整性和管理,那么您应该使用内部表。如果您希望在Hive中访问未受Hive控制的数据,或者有其他程序需要访问数据文件,那么您应该使用外部表。</p>
<h3 id="10-1-2-简述对Hive桶的理解?"><span class="heading-link">10.1.2 简述对Hive桶的理解?</span></h3><p>Hive桶(Bucket)是指将Hive表的数据分成固定大小的几个部分,用于提高数据查询的效率和性能。桶在Hive中被用作一个数据分区技术,类似于分区思想,但是每个Hive桶都只包含一个分区。当我们将数据桶存储到HDFS的时候,Hive会将每个数据桶作为文件存储到文件系统中,并为每个数据桶分配一个唯一的编号。使用Hive桶可以提高数据查询的速度,因为它们允许Hive仅查找包含所需数据的文件,而不是整个文件或表。这样,Hive可以跳过许多不必要的数据扫描和过滤操作,从而提高了查询的效率和性能。</p>
<h3 id="10-1-3-HBase和Hive的区别?"><span class="heading-link">10.1.3 HBase和Hive的区别?</span></h3><p>Hive是运行在Hadoop上的一个工具,准确地讲是一个搜索工具。当对海量数据进行搜索时,Hadoop的计算引擎是MapReduce。但是对MapReduce的操作和编程是非常复杂的。于是Hive的存在就让复杂的编程过程简化成了用SQL语言对海量数据的操作。这大大减轻了程序员的工作量。Hive支持多种查询语句和函数,同时也支持自定义函数和嵌套查询等特性。Hive适用于处理离线批量数据,主要用于数据分析和挖掘。</p>
<p>Hive支持多种查询语句和函数,同时也支持自定义函数和嵌套查询等特性。Hive适用于处理离线批量数据,主要用于数据分析和挖掘。它是Hadoop的子项目,当然也可以理解为一个工具。Hadoop的数据运算是由MapReduce完成的,而数据存储是由HDFS完成的。HDFS是分布式存储,这是Hadoop存储数据的特点,但由此带来的问题就是数据的无序和散乱。HBase在复杂的数据处理和实时应用方面表现良好,适用于实时应用、缓存、日志处理等场景。</p>
<p>因此,两者的应用场景和数据处理方式有所不同,需要根据具体的需求来选择。</p>
<h3 id="10-1-4-简述Spark宽窄依赖"><span class="heading-link">10.1.4 简述Spark宽窄依赖</span></h3><p>Spark中的宽依赖(Wide Dependency)指的是一个父RDD分区被多个子RDD分区依赖的情况,而窄依赖(Narrow Dependency)指的是一个父RDD分区只被一个子RDD分区依赖的情况。</p>
<p>宽依赖的产生可能会涉及到数据的洗牌(Shuffle)操作,这会导致性能的下降。因此在Spark中尽量使用窄依赖,可以提高Spark的性能。但是,当父RDD的分区数和子RDD的分区数不一致时,则可能需要进行数据洗牌,因此会生成宽依赖。</p>
<h3 id="10-1-5-Hadoop和Spark的相同点和不同点"><span class="heading-link">10.1.5 Hadoop和Spark的相同点和不同点</span></h3><p>Hadoop和Spark都是大数据处理的开源框架,具有以下相同点和不同点:</p>
<p><strong>相同点:</strong></p>
<ul>
<li>都可以进行大规模数据处理,并可处理大量数据的存储和处理。</li>
<li>都支持各种数据源和格式。</li>
<li>都具有失败恢复和节点管理的功能。</li>
</ul>
<p><strong>不同点:</strong></p>
<ul>
<li>Spark比Hadoop更适合处理迭代式应用程序和流处理任务。</li>
<li>Spark的速度比Hadoop更快,因为它使用了内存计算。</li>
<li>Hadoop更适合处理批量离线处理任务。</li>
<li>Hadoop是基于MapReduce方法的,而Spark提供了更为灵活的API和计算模型。<h3 id="10-1-6-Spark为什么比MapReduce块?"><span class="heading-link">10.1.6 Spark为什么比MapReduce块?</span></h3>Spark相对于MapReduce的执行速度更快,原因如下:</li>
</ul>
<p>1.** 内存计算:**与MapReduce不同,Spark在内存中计算大型数据集,而不必在HDFS上频繁读写数据。这减少了磁盘IO的负担,也节省了宝贵的时间。</p>
<p><strong>2. 基于DAG的调度:</strong>Spark使用基于DAG的任务调度来执行数据处理任务。这意味着它可以在多个任务之间共享数据,减少不必要的计算。MapReduce则必须按固定顺序执行简单的Map和Reduce任务。</p>
<p><strong>3. 更快的通信机制:</strong>Spark使用基于内存的轻量级通信机制,而MapReduce使用基于磁盘的通信机制。这使Spark能够更快地完成通信任务,提高了执行速度。</p>
<p>综上所述,Spark使用一系列优化技术,使其更快地处理大型数据集。</p>
<h3 id="10-1-7-说说你对Hadoop生态的认识"><span class="heading-link">10.1.7 说说你对Hadoop生态的认识</span></h3><p>Hadoop生态是一个开源的大数据处理框架,包括了Hadoop核心组件(HDFS、YARN和MapReduce),以及各种相关工具和库,如HBase、Hive、Pig、Spark、ZooKeeper等。Hadoop生态凭借着其高可靠性、高可扩展性、高吞吐量等优势,在大数据处理领域得到了广泛应用。</p>
<p>Hadoop生态的核心组件HDFS是一个分布式文件系统,它可以将大量数据分布存储在集群中的多个节点上,提高了存储系统的可靠性和容错能力。YARN是资源调度平台,它可以将集群中的计算资源合理地分配给不同的应用程序,确保集群的高效利用。MapReduce是一种分布式计算框架,可以将大规模的数据集分解成多个小块进行并行处理,从而加快计算速度。除此之外,Hadoop生态中还有大量的相关工具和库,如HBase用于实时读写大规模数据、Hive用于数据仓库查询和分析、Pig用于数据分析、Spark用于内存计算等。</p>
<p>总之,Hadoop生态是大数据处理领域的重要工具,它的优秀设计和强大功能正在为各种类型的企业和组织带来极大的商业价值。</p>
<h2 id="10-2-实战"><span class="heading-link">10.2 实战</span></h2><p>从新闻文章中发现热门话题和趋势话题是舆论监督的一项重要任务。在这个项目中,你的任务是使用新闻数据集进行文本数据分析,使用 Python 中 pySpark 的 RDD 和 DataFrame 的 API。问题是计算新闻文章数据集中每年各词的权重,然后选择每年 top-k个最重要的词。</p>
<p>PS:解决方案源码填空示例与结果验证数据在<code>\juicy-bigdata\experiments\10</code> 期末大作业目录下</p>
<p>同时提前安装<code>pyspark</code>包</p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
</entry>
<entry>
<title>第六章 期中大作业</title>
<url>/2023/02/26/%E7%AC%AC%E5%85%AD%E7%AB%A0%20%E6%9C%9F%E4%B8%AD%E5%A4%A7%E4%BD%9C%E4%B8%9A/</url>
<content><![CDATA[<h1 id="6-1-面试题"><span class="heading-link">6. 1 面试题</span></h1><h2 id="6-1-1-简述Hadoop小文件弊端"><span class="heading-link">6.1.1 简述Hadoop小文件弊端</span></h2><ol>
<li>HDFS小文件的弊端:HDFS上每个小文件都要在NameNode上建立一个索引,这个索引大小约为150byte,这样当小文件比较多的时候,就会产生很多的索引文件,一方面会大量占用NameNode的内存空间,另一方面就是索引文件过大使得索引速度变慢。</li>
<li>hdfs使用于高吞吐量,不适合低时间延迟的访问,如果同时存入大量的小文件会花费很长的使时间。</li>
<li>流式读取的方式,不适合多用户写入,以及任意位置写入。如果访问小文件,则必须从一个datanode跳转到另外一个datanode,这样大大降低了读取性能。<h2 id="6-1-2-HDFS中DataNode挂掉如何处理?"><span class="heading-link">6.1.2 HDFS中DataNode挂掉如何处理?</span></h2>部分节点的datanode消失后,除了格式化hdfs并删除相应文件 的方法之外,还有一种简单可行的方法是恢复快照,直接回到之前状态。<h2 id="6-1-3-HDFS中NameNode挂掉如何处理?"><span class="heading-link">6.1.3 HDFS中NameNode挂掉如何处理?</span></h2>挂掉后首先肯定是进行重启,如果时间段比较高峰期,肯定要快速移动文件进行复原,等错过高峰进行事故分析。然后将SecondaryNameNode中数据拷贝到namenode存储数据的目录;<h2 id="6-1-4-HBase读写流程?"><span class="heading-link">6.1.4 HBase读写流程?</span></h2></li>
</ol>
<p><strong>Hbase的写入数据流程</strong>:<br>1)由客户端发起写数据请求,首先会与zookeeper建立连接<br>2)从zookeeper中获取hbase:meta表被哪一个regionserver所管理<br>3)连接hbase:meta表中获取对应的regionserver地址 (从meta表中获取当前要写入数据的表对应的region所管理的regionserver) 只会返回一个regionserver地址<br>4)与要写入数据的regionserver建立连接,然后开始写入数据,将数据首先会写入到HLog,然后将数据写入到对应store模块中的memstore中<br>(可能会写多个),当这两个地方都写入完成之后,表示数据写入完成。<br><strong>HBase数据的读取流程:</strong><br>1.Client访问zookeeper,获取元数据存储所在的regionserver<br>2.通过刚刚获取的地址访问对应的regionserver,拿到对应的表存储的regionserver<br>3.去表所在的regionserver进行数据的读取<br>4.查找对应的region,在region中寻找列族,先找到memstore,找不到去blockcache中寻找,再找不到就进行storefile的遍历<br>5.找到数据之后会先缓存到blockcache中,再将结果返回blockcache逐渐满了之后,会采用LRU的淘汰策略。</p>
<h2 id="6-1-5-MapReduce为什么一定要有Shuffle过程"><span class="heading-link">6.1.5 MapReduce为什么一定要有Shuffle过程</span></h2><p>因为单台机器的资源处理不了分布式大数据量全局分区/排序/分组。所以需要通过Shuffle对每一台机器的数据构建一个Task来做分区的标记(通过Hash或Ranger分区器)这样所有的数据被标记后就可以根据标记进入指定分区,实现全局分区/分组/排序功能。</p>
<h2 id="6-1-6-MapReduce中的三次排序"><span class="heading-link">6.1.6 MapReduce中的三次排序</span></h2><p>1)当map函数产生输出时,会首先写入内存的环形缓冲区,当达到设定的阀值,在刷写磁盘之前,后台线程会将缓冲区的数据划分成相应的分区。在每个分区中,后台线程按键进行内排序<br>2)在Map任务完成之前,磁盘上存在多个已经分好区,并排好序的,大小和缓冲区一样的溢写文件,这时溢写文件将被合并成一个已分区且已排序的输出文件。由于溢写文件已经经过第一次排序,所有合并文件只需要再做一次排序即可使输出文件整体有序。<br>3)在reduce阶段,需要将多个Map任务的输出文件copy到ReduceTask中后合并,由于经过第二次排序,所以合并文件时只需再做一次排序即可使输出文件整体有序。</p>
<h2 id="6-1-7-MapReduce为什么不能产生过多小文件"><span class="heading-link">6.1.7 MapReduce为什么不能产生过多小文件</span></h2><p>针对MapReduce而言,每一个小文件都是一个Block,都会产生一个InputSplit,最终每一个小文件都会 产生一个map任务,这样会导致同时启动太多的Map任务,Map任务的启动是非常消耗性能的,但是启动了以后执行了很短时间就停止了,因为小文件的数据量太小了,这样就会造成任务执行消耗的时间还没有启动任务消耗的时间多,这样也会影响MapReduce执行的效率。</p>
<h1 id="6-2-实战"><span class="heading-link">6.2 实战</span></h1><p>在此次作业中,你需要使用 MRJob 对在线社交网络的数据集进行数据分析。对于这部分内容,由于是初学,所以还不太会用,还需要继续学习。</p>
<h1 id="参考资料"><span class="heading-link">参考资料</span></h1><p><span class="external-link"><a href="https://datawhalechina.github.io/juicy-bigdata/#/" target="_blank" rel="noopener">Juicy Big Data</a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
</entry>
<entry>
<title>回首2022,迎接2023</title>
<url>/2023/01/01/%E5%9B%9E%E9%A6%962022%EF%BC%8C%E8%BF%8E%E6%8E%A52023/</url>
<content><![CDATA[<p><span class="external-link"><a href="https://imgse.com/i/pSCzAts" target="_blank" rel="noopener"><img src="https://s1.ax1x.com/2023/01/01/pSCzAts.jpg" alt="pSCzAts.jpg"></a><i class="fa fa-external-link"></i></span></p>
<h1 id="回首2022"><span class="heading-link">回首2022</span></h1><p>时间过得很快,2022年本科毕业了,感觉自己还像一个小孩,在水井下,渴望着能够跳出水井的那一天。2022年1月到6月,在忙碌着赶毕业论文,经过自己努力和老师的帮助,最终获得了“优秀毕业论文”的荣誉。7月到8月一直在忙着练车,所以在学校外租房一个月。</p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>自我总结</category>
</categories>
<tags>
<tag>随笔</tag>
</tags>
</entry>
<entry>
<title>用PS将图片中的英文变成中文</title>
<url>/2022/01/05/%E7%94%A8PS%E5%B0%86%E5%9B%BE%E7%89%87%E4%B8%AD%E7%9A%84%E8%8B%B1%E6%96%87%E5%8F%98%E6%88%90%E4%B8%AD%E6%96%87/</url>
<content><![CDATA[<h2 id="用PS将图片中的英文变成中文"><span class="heading-link">用PS将图片中的英文变成中文</span></h2><p>相信现在很多童鞋都面临着毕业压力,开始焦急的撰写毕业论文了。那么对于英文文献中的表格或图片,我们如何将里面的英文变成中文呢???现在教大家用PS将中文变成英文,马上学起来。如果你们还没有下载Photoshop,也可以使用 <span class="external-link"><a href="https://ps.gaoding.com/#/" target="_blank" rel="noopener">在线PS</a><i class="fa fa-external-link"></i></span>,方便又好用。</p>
<h5 id="1-首先,先保存你所需要修改的图片"><span class="heading-link">1.首先,先保存你所需要修改的图片</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUHAK" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUHAK.png" alt="TOUHAK.png"></a><i class="fa fa-external-link"></i></span></p>
<h5 id="2-点击污点修复画笔工具"><span class="heading-link">2.点击污点修复画笔工具</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUbtO" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUbtO.png" alt="TOUbtO.png"></a><i class="fa fa-external-link"></i></span></p>
<h5 id="3-将要P掉的英文擦掉"><span class="heading-link">3.将要P掉的英文擦掉</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUO9e" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUO9e.png" alt="TOUO9e.png"></a><i class="fa fa-external-link"></i></span></p>
<h5 id="4-然后点击字体排版工具,根据情况调整字体颜色,大小等"><span class="heading-link">4.然后点击字体排版工具,根据情况调整字体颜色,大小等</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUX1H" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUX1H.png" alt="TOUX1H.png"></a><i class="fa fa-external-link"></i></span></p>
<h5 id="5-点击移动工具,可以调整字体的位置"><span class="heading-link">5.点击移动工具,可以调整字体的位置</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUqhD" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUqhD.png" alt="TOUqhD.png"></a><i class="fa fa-external-link"></i></span></p>
<h5 id="6-最后快速保存为png格式即可"><span class="heading-link">6.最后快速保存为png格式即可</span></h5><p><span class="external-link"><a href="https://imgtu.com/i/TOUjcd" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/04/TOUjcd.png" alt="TOUjcd.png"></a><i class="fa fa-external-link"></i></span></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>小技巧</category>
</categories>
<tags>
<tag>Photoshop</tag>
</tags>
</entry>
<entry>
<title>回首2021,展望2022</title>
<url>/2022/01/01/%E5%9B%9E%E9%A6%962021%EF%BC%8C%E5%B1%95%E6%9C%9B2022/</url>
<content><![CDATA[<h1 id="回首2021,展望2022"><span class="heading-link">回首2021,展望2022</span></h1><hr>
<h2 id="回首2021"><span class="heading-link">回首2021</span></h2><hr>
<h3 id="望着窗台的时间一点点的流逝,距离2022年还有不到一个小时,就着急着记录2021年走过的点点滴滴。当拿起笔时,又不知改写下什么。"><span class="heading-link">望着窗台的时间一点点的流逝,距离2022年还有不到一个小时,就着急着记录2021年走过的点点滴滴。当拿起笔时,又不知改写下什么。</span></h3><p>2021年喝过苦水也尝过甜汤,这一年对我未来有着深远的意义。从2021年初开始每天都提心吊胆的生活着。因为我需要一边准备着考研,一边准备推免材料,从而让我十分害怕两者都不能兼顾。于是从1月份开始便于哥哥讨论需要购买的考研课本,和想要报考的学校,同时也在慢慢地看一些考研的视频。另外,我还报名参加了2021年的美国大学生数学建模竞赛,这次竞赛给我的推免带来了希望,也让我有了一丝宽慰。在竞赛是三四天里,我们小组成员线上沟通,有一点儿不方便,大家的时间和事情都是不可预见的,所以写起来还是有一定的难度。在最后一天里,我通宵了,一直到早上交完论文才敢去床上躺着。因为我知道,这可能是我最后一次参加这个比赛,也是唯一的机会了,只能拼尽全力,不留一丝遗憾。上天还是很眷顾我的,我们小组也取得了较好的成绩,为我们大学的建模生涯画上了圆满的句号。</p>
<p>2021年我想的最多的就是觉得时间不够,一直在抱怨两手准备的困难,不能全身心的投入到某一方面。现在回头想想,还是觉得自己的时间管理能力不够,学习效率不高,一直在抱怨,不曾知,在抱怨的时候,也许自己已经可以完成某一件事情。</p>
<p>2021年的春学期开学,由于优先准备推免的事情,所以当时也十分担心自己的学分绩会不会下降,在这之前,觉得提高学分绩应该没那么难,但是事实上还是高估了自己,最终学分绩还是下降了一点,有一点伤心和担忧。一直觉得自己是一个感性的人,看电视也可能感动伤心落泪,所以一点点事情都会影响到自己的情绪,内心总是控制不住的想太多。</p>
<p>2021年的暑假,我留校了。每天重复三点一线——宿舍,食堂,基地。这时候我一边准备着考研复习(当时已经看了数二、英二)一边准备着推免材料的准备(准备的较晚,很多是夏令营5,6月份就已经开始了),主要写了简历、自我介绍、获奖材料等。在这个暑假,我参加了YN、XD、HN等多所大学的夏令营,有一些在投递简历的时候就已经结束了。由于报名时间较晚,所以夏令营只得到了一个offer——YN。在这个过程中,由于联系老师较晚,错失了XD,最后联系的时候,要么不理你,要么说已经招满了。每每等到一封邮件,不懂是高兴还是伤心,因为你不知道是好消息还是坏消息。当时还觉得自己面试挺好了,但是上帝往往会跟你开各种玩笑。YN大学的offer让我有了一点宽慰,让自己心里有底了,所以还是十分感谢YN大学信息院的老师们给了我机会。</p>
<p>很快迎来了9月份,9月底将会公布获得推免资格的学生。心里十分沉重,也把不断的给自己鼓励打气<span class="github-emoji" style="color: transparent;background:no-repeat url(https://github.githubassets.com/images/icons/emoji/unicode/1f4aa.png?v8) center/contain" data-src="https://github.githubassets.com/images/icons/emoji/unicode/1f4aa.png?v8">💪</span><span class="github-emoji" style="color: transparent;background:no-repeat url(https://github.githubassets.com/images/icons/emoji/unicode/1f4aa.png?v8) center/contain" data-src="https://github.githubassets.com/images/icons/emoji/unicode/1f4aa.png?v8">💪</span>,大不了就考研,或者找工作,我并不比别人差。在公布名单的前一晚,我哭了<span class="github-emoji" style="color: transparent;background:no-repeat url(https://github.githubassets.com/images/icons/emoji/unicode/1f62d.png?v8) center/contain" data-src="https://github.githubassets.com/images/icons/emoji/unicode/1f62d.png?v8">😭</span><span class="github-emoji" style="color: transparent;background:no-repeat url(https://github.githubassets.com/images/icons/emoji/unicode/1f62d.png?v8) center/contain" data-src="https://github.githubassets.com/images/icons/emoji/unicode/1f62d.png?v8">😭</span>。因为老师的一句话,让我心里十分的难受,于是控制不住的哭了。哭完后,也好受了很多。在那天里,我坐立不安,焦急的等待的领导开会后的结果。终于,名单上有了我的名字<span class="github-emoji" style="color: transparent;background:no-repeat url(https://github.githubassets.com/images/icons/emoji/unicode/1f60a.png?v8) center/contain" data-src="https://github.githubassets.com/images/icons/emoji/unicode/1f60a.png?v8">😊</span>,虽然有一篇文章没有加分,让我有点儿伤心,但是最后的结果还是值得庆祝的。得到推免名额后,并不代表就结束了,手上并没有满意的offer。于是就必须在预推免阶段拿到offer了,期间我也参加了很多学校的预推免,有成功,也有失败。(天啊,看了一下时间,已经23:59分了,看来不能在新年前写完了~~)最后还是拿到了满意的offer——JN大学,虽然过程很艰难,但是一定不要放弃,不放过每一个机会。我也十分感谢我的导师,听着声音,很温柔,亲切没有。如果没有导师,可能我也不能去理想的学校。</p>
<p>最后总结一下:2021年是有意义的一年,道路十分坎坷,感谢每一个在身边陪伴的人。我也十分珍惜每一段经历,让我成长了很多,收获了很多。不忘初心,方得始终!附上一张自己2021年的部分学习事项,由于比较长,所以以图片链接来记录一下(请忽略后面的延长时间)。附上<span class="external-link"><a href="https://s4.ax1x.com/2022/01/01/T5Svzd.jpg" target="_blank" rel="noopener">学习事项图</a><i class="fa fa-external-link"></i></span>,今后也要脚踏实地,珍惜每一分钟。</p>
<h2 id="展望2022"><span class="heading-link">展望2022</span></h2><hr>
<p>2002年,新的一年里,希望自己能改掉自己的缺点,提高时间管理能力。改掉拖沓,看剧的毛病,希望2022好运不请自来。阳光,就在隧道的尽头等着我们,保持韧性和耐力,就会一步步地接近光明。</p>
<p>2022上半年的主要任务:</p>
<ul>
<li><input disabled="" type="checkbox"> <p>完成毕业论文</p>
</li>
<li><input disabled="" type="checkbox"> <p>学习自然语言处理的相关知识</p>
<p><span class="external-link"><a href="https://imgtu.com/i/TIPWVI" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/01/TIPWVI.png" alt="TIPWVI.png"></a><i class="fa fa-external-link"></i></span></p>
<p><span class="external-link"><a href="https://imgtu.com/i/TIP2qA" target="_blank" rel="noopener"><img src="https://s4.ax1x.com/2022/01/01/TIP2qA.png" alt="TIP2qA.png"></a><i class="fa fa-external-link"></i></span></p>
</li>
</ul>
<p>摘自<span class="external-link"><a href="https://datawhale.feishu.cn/docs/doccn0AOicI3LJ8RwhY0cuDPSOc#" target="_blank" rel="noopener">Datawhale人工智能培养方案</a><i class="fa fa-external-link"></i></span>(欢迎关注博客<a href="https://lvshaomei.github.io/">Mia</a>)</p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>自我总结</category>
</categories>
<tags>
<tag>随笔</tag>
</tags>
</entry>
<entry>
<title>nodeJs</title>
<url>/2020/11/25/nodeJs/</url>
<content><![CDATA[<h2 id="Node-js服务"><span class="heading-link">Node.js服务</span></h2><h3 id="1-了解Node-js"><span class="heading-link">1. 了解Node.js</span></h3><ul>
<li><p>官网:<span class="external-link"><a href="https://nodejs.org/en/" target="_blank" rel="noopener">nodejs</a><i class="fa fa-external-link"></i></span></p>
</li>
<li><p>定义:Node.js是一个基于chrome V8引擎的JavaScript运行时</p>
</li>
</ul>
<h3 id="2-Koa框架"><span class="heading-link">2. Koa框架</span></h3><ul>
<li><p>官网:<span class="external-link"><a href="https://koa.bootcss.com/" target="_blank" rel="noopener">koa</a><i class="fa fa-external-link"></i></span></p>
</li>
<li><p>koa是Node.js的下一代Web框架</p>
</li>
<li><p>koa作为一个Node.js的框架,代码量非常之少</p>
</li>
</ul>
<h3 id="3-XMLHttpRequest对象"><span class="heading-link">3. XMLHttpRequest对象</span></h3><ul>
<li><p>简称XHR</p>
</li>
<li><p>XMLHttpRequest = XML+Http+Request</p>
</li>
<li><p>其本质一个可以发送Http请求,处理Http响应,与服务器之间进行异步交换数据的对象,其核心是Http。</p>
</li>
</ul>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUvldO" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUvldO.png" alt="DUvldO.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ul>
<li><strong>XMLHttpRequest的使用</strong></li>
</ul>
<blockquote>
<p><strong>分四步走:</strong></p>
<p><span class="external-link"><a href="https://imgchr.com/i/DUv1oD" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUv1oD.png" alt="DUv1oD.png"></a><i class="fa fa-external-link"></i></span></p>
<p>所以要了解XMLHttpRequest对象,就要先了解Http</p>
</blockquote>
<ol start="4">
<li><h3 id="Http请求"><span class="heading-link">Http请求</span></h3></li>
</ol>
<blockquote>
<p>一个Http请求由4个 部分组成:</p>
</blockquote>
<ul>
<li><p>Http请求的方法:get、post、delete、put</p>
</li>
<li><p>正在请求的url(/home/index.html)</p>
</li>
<li><p>请求头(可选)</p>
</li>
<li><p>请求的主体(可选)</p>
</li>
</ul>
<p>所以先创建XMLHttpRequest对象后,调用XMLHttpRequest对象的open()方法去指定请求的两个必要部分:请求方法和url</p>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUvNQI" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUvNQI.png" alt="DUvNQI.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<p>open()的第一个参数指定Http请求方法,open()的第二个参数是URL,请求的主要内容,第三个参数”true”代表使用异步。如果有请求头的话,请求进程的下一个步骤是设置它。例如,POST请求需要”Content-type”。</p>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUvUyt" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUvUyt.png" alt="DUvUyt.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<p>(只在post请求时才使用)</p>
<p>使用XMLHttpRequest发起Http请求的最后一步是指定请求主体(可选)并向服务器发送它。xhr.send()</p>
<ol start="5">
<li><h3 id="Http响应"><span class="heading-link">Http响应</span></h3></li>
</ol>
<p>服务器返回的Http响应包含3部分:</p>
<ul>
<li><p>状态码,用来显示请求的成功和失败</p>
</li>
<li><p>响应头</p>
</li>
<li><p>响应主体</p>
</li>
</ul>
<p>readyState是一个整数,指定了Http请求的状态</p>
<p><span class="external-link"><a href="https://imgchr.com/i/DUv0w8" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUv0w8.png" alt="DUv0w8.png"></a><i class="fa fa-external-link"></i></span></p>
<ol start="6">
<li><h3 id="跨域请求"><span class="heading-link">跨域请求</span></h3></li>
</ol>
<ul>
<li>什么是同源策略</li>
</ul>
<blockquote>
<p>所谓同源是指"协议+域名+端口"三者相同,即便两个不同的域名指向同一个 ip 地址,也非同源。同源策略/SOP(Same origin policy)是一种约定,由 Netscape 公司 1995 年引入浏览器,它是浏览器最核心也最基本的安全功能,是一个著名的安全策略。现在所有支持 JavaScript 的浏览器都会使用这个策略。如果缺少了同源策略,浏览器很容易受到 XSS、 CSFR 等攻击。</p>
</blockquote>
<ul>
<li>什么是源</li>
</ul>
<ul>
<li><p>源(origin)就是协议、域名和端口号</p>
</li>
<li><p>若地址里面的协议、域名和端口号均相同则属于同源</p>
</li>
</ul>
<ul>
<li><p><span class="external-link"><a href="https://imgchr.com/i/DUvsYQ" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUvsYQ.png" alt="DUvsYQ.png"></a><i class="fa fa-external-link"></i></span></p>
</li>
<li><p>跨域(在两个不同域之间也可以成功获得另一个域名的信息)</p>
</li>
</ul>
<blockquote>
<p>跨域的方法:JSONP、Proxy、iframe、CORS</p>
<p>天然可以跨域的标签:script (img、link)</p>
<p><span class="external-link"><a href="https://imgchr.com/i/DUv2yq" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUv2yq.png" alt="DUv2yq.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>node初学</category>
</categories>
<tags>
<tag>node js</tag>
</tags>
</entry>
<entry>
<title>JS笔记--运算符(续)</title>
<url>/2020/11/25/js%E8%BF%90%E7%AE%97%E7%AC%A6/</url>
<content><![CDATA[<h2 id="JS笔记(续)—运算符"><span class="heading-link">JS笔记(续)—运算符</span></h2><ol>
<li><h3 id="运算符定义:"><span class="heading-link">运算符定义:</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DU7xuF" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU7xuF.png" alt="DU7xuF.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="2">
<li><h3 id="算术运算符概述"><span class="heading-link">算术运算符概述</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUHF9x" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUHF9x.png" alt="DUHF9x.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="3">
<li><h3 id="浮点数精度问题"><span class="heading-link">浮点数精度问题</span></h3></li>
</ol>
<blockquote>
<p><u>不要直接拿浮点数来计算,要尽量避开浮点数。</u></p>
<p><span class="external-link"><a href="https://imgchr.com/i/DUHZuD" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUHZuD.png" alt="DUHZuD.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="4">
<li><h3 id="表达式和返回值"><span class="heading-link">表达式和返回值</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUHB80" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUHB80.png" alt="DUHB80.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="5">
<li><h3 id="递增、递减"><span class="heading-link">递增、递减</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUHhP1" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUHhP1.png" alt="DUHhP1.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ul>
<li><strong>前置递增</strong></li>
</ul>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUHqVH" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUHqVH.png" alt="DUHqVH.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ul>
<li><strong>后置递增</strong></li>
</ul>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUb9sS" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUb9sS.png" alt="DUb9sS.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="5">
<li><h3 id="比较运算符"><span class="heading-link">比较运算符</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUbFaj" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUbFaj.png" alt="DUbFaj.png"></a><i class="fa fa-external-link"></i></span></p>
<p><span class="external-link"><a href="https://imgchr.com/i/DUbNQK" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUbNQK.png" alt="DUbNQK.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="7">
<li><h4 id="逻辑运算符"><span class="heading-link">逻辑运算符</span></h4></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUq6AJ" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUq6AJ.png" alt="DUq6AJ.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ul>
<li><strong>逻辑与的短路运算</strong></li>
</ul>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DULD2t" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULD2t.png" alt="DULD2t.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ul>
<li><strong>逻辑或的逻辑中断</strong></li>
</ul>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DULhPs" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULhPs.png" alt="DULhPs.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="8">
<li><h3 id="赋值运算符"><span class="heading-link">赋值运算符</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUL4Gn" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUL4Gn.png" alt="DUL4Gn.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="9">
<li><h3 id="运算符优先级"><span class="heading-link">运算符优先级</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DULTMV" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULTMV.png" alt="DULTMV.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="10">
<li><h3 id="Switch语法使用"><span class="heading-link">Switch语法使用</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DULLa4" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULLa4.png" alt="DULLa4.png"></a><i class="fa fa-external-link"></i></span></p>
<p><strong>注意事项:</strong></p>
<p><span class="external-link"><a href="https://imgchr.com/i/DULzxx" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULzxx.png" alt="DULzxx.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="11">
<li><h3 id="Switch语句和if-else-if-语句的区别"><span class="heading-link">Switch语句和if else if 语句的区别</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DULvGR" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DULvGR.png" alt="DULvGR.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="12">
<li><h3 id="断点调试"><span class="heading-link">断点调试</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DUOkIH" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DUOkIH.png" alt="DUOkIH.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>]]></content>
<categories>
<category>JS笔记</category>
</categories>
<tags>
<tag>js</tag>
</tags>
</entry>
<entry>
<title>AJAX</title>
<url>/2020/11/25/AJAX/</url>
<content><![CDATA[<h2 id="第一节AJAX"><span class="heading-link">第一节AJAX</span></h2><h3 id="1-AJAX的产生"><span class="heading-link">1. AJAX的产生</span></h3><p> <span class="external-link"><a href="https://imgchr.com/i/DU452F" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU452F.png" alt="DU452F.png"></a><i class="fa fa-external-link"></i></span></p>
<h3 id=""><span class="heading-link"></span></h3><h3 id="2.-AJAX-的优势"><span class="heading-link">2.** AJAX**的优势**</span></h3><p><span class="external-link"><a href="https://imgchr.com/i/DU4zxe" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU4zxe.png" alt="DU4zxe.png"></a><i class="fa fa-external-link"></i></span></p>
<h3 id="3-AJAX简介"><span class="heading-link">3. AJAX简介</span></h3><ul>
<li><p><strong>AJAX 是一种在无需重新加载整个网页的情况下,能够更新部分网页的技术。</strong></p>
</li>
<li><p><strong>AJAX = 异步 JavaScript 和 XML</strong></p>
</li>
<li><p><strong>AJAX 是一种用于创建快速动态网页的技术。</strong></p>
</li>
<li><p><strong>通过在后台与服务器进行少量数据交换,AJAX 可以使网页实现异步更新。这意味着可以在不重新加载整个网页的情况下,对网页的某部分进行更新。</strong></p>
</li>
<li><p><strong>传统的网页(不使用 AJAX)如果需要更新内容,必需重载整个网页面。</strong></p>
</li>
</ul>
<blockquote>
<p>**具体内容可参考</p>
<p><span class="external-link"><a href="https://www.w3school.com.cn/ajax/ajax_intro.asp" target="_blank" rel="noopener">w3cschool</a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<h3 id="4-那XML有是什么呢?"><span class="heading-link">4. 那XML有是什么呢?</span></h3><ul>
<li><p><strong>XML 被设计用来传输和存储数据,HTML 被设计用来显示数据。</strong></p>
</li>
<li><p><strong>XML 指可扩展标记语言(EXtensible Markup Language)</strong></p>
</li>
<li><p><strong>XML 是一种标记语言,很类似 HTML</strong></p>
</li>
<li><p><strong>XML 的设计宗旨是传输数据,而非显示数据</strong></p>
</li>
<li><p><strong>XML 标签没有被预定义。您需要自行定义标签。</strong></p>
</li>
<li><p><strong>XML 被设计为具有自我描述性</strong></p>
</li>
<li><p><strong>XML 是独立于软件和硬件的信息传输工具。</strong></p>
</li>
</ul>
<h2 id="第二节-网络通信"><span class="heading-link">第二节 网络通信</span></h2><ol>
<li><h3 id="网络架构"><span class="heading-link">网络架构</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DU5ZRS" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU5ZRS.png" alt="DU5ZRS.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="2">
<li><h3 id="TCP-IP"><span class="heading-link">TCP/IP</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DU5nMQ" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU5nMQ.png" alt="DU5nMQ.png"></a><i class="fa fa-external-link"></i></span></p>
</blockquote>
<ol start="3">
<li><h3 id="传输过程"><span class="heading-link">传输过程</span></h3></li>
</ol>
<blockquote>
<p><span class="external-link"><a href="https://imgchr.com/i/DU5JRU" target="_blank" rel="noopener"><img src="https://s3.ax1x.com/2020/11/25/DU5JRU.png" alt="DU5JRU.png"></a><i class="fa fa-external-link"></i></span></p>