-
Notifications
You must be signed in to change notification settings - Fork 0
/
notebook.Rmd
1357 lines (955 loc) · 42.2 KB
/
notebook.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Working with Hugging Face
## Getting Started with Hugging Face
### Searching the Hub with Python
The Hugging Face Hub provides a nice user interface for searching for
models and learning more about them. At times, you may find it
convenient to be able to do the same thing without leaving the
development environment. Fortunately, Hugging Face also provides a
Python package which allows you to find models through code.
Use the `huggingface_hub` package to find the most downloaded model for
text classification.
`HfApi` and `ModelFilter` from the `huggingface_hub` package is already
loaded for you.
**Instructions**
- Create the instance of the API to the Hugging Face Hub.
- Return a list of one item which is the most downloaded model text
classification task.
- Store the returned object as a list named `modelList`.
**Answer**
```{python}
# Create the instance of the API
api = HfApi()
# Return the filtered list from the Hub
models = api.list_models(
filter=ModelFilter(task="text-classification"),
sort="downloads",
direction=-1,
limit=1
)
# Store as a list
modelList = list(models)
print(modelList[0].modelId)
```
### Saving a model
There may be situations where downloading and storing a model locally
(i.e. a directory on your computer) is desired. For example, if working
offline.
Practice downloading and saving here. An instance of `AutoModel` is
already loaded for you under the same name.
**Instructions**
- Instantiate the model class for the
`distilbert-base-uncased-finetuned-sst-2-english` model.
- Save the model as the `modelId` under "models/".
**Answer**
```{python}
modelId = "distilbert-base-uncased-finetuned-sst-2-english"
# Instantiate the AutoModel class
model = AutoModel.from_pretrained(modelId)
# Save the model
model.save_pretrained(save_directory=f"models/{modelId}")
```
### Inspecting datasets
The datasets on Hugging Face range in terms of size, information, and
features. Therefore it's beneficial to inspect it before committing to
loading a dataset into your environment.
Let's inspect the "derenrich/wikidata-en-descriptions-small" dataset.
*Note: this exercise may take a minute due to the dataset size.*
**Instructions**
- Import `load_dataset_builder`.
- Create the dataset builder to inspect the dataset.
- Print the features for the dataset.
**Answer**
```{python}
# Load the module
from datasets import load_dataset_builder
# Create the dataset builder
reviews_builder = load_dataset_builder("derenrich/wikidata-en-descriptions-small")
# Print the features
print(reviews_builder.info.features)
```
### Loading datasets
Hugging Face built the `dataset` package for interacting with datasets.
There are a lot of convenient functions, including
`load_dataset_builder` which we just used. After inspecting a dataset to
ensure its the right one for your project, it's time to load the
dataset! For this, we can leverage input parameters for `load_dataset`
to specify which parts of a dataset to load, i.e. the "train" dataset
for English wikipedia articles.
The `load_dataset` module from the `datasets` package is already loaded
for you. Note: the `load_dataset` function was modified for the purpose
of this exercise.
**Instructions**
- Load the `"wikimedia/wikipedia"` dataset and save as `wikipedia`.
**Answer**
```{python}
# Load the train portion of the dataset
wikipedia = load_dataset("wikimedia/wikipedia", language="20231101.en", split="train")
print(f"The length of the dataset is {len(wikipedia)}")
```
### Manipulating datasets
There will likely be many occasions when you will need to manipulate a
dataset before using it within a ML task. Two common manipulations are
filtering and selecting (or slicing). Given the size of these datasets,
Hugging Face leverages arrow file types.
This means performing manipulations are slightly different than what you
might be used to. Fortunately, there's already methods to help with
this!
The dataset is already loaded for you under `wikipedia`.
**Instructions**
- Filter the dataset for rows with the term "football" in the `text`
column and save as `filtered`.
- Select a single example from the filtered dataset and save as
`example`.
**Answer**
```{python}
# Filter the documents
filtered = wikipedia.filter(lambda row: "football" in row["text"])
# Create a sample dataset
example = filtered.select(range(1))
print(example[0]["text"])
```
## Building Pipelines with Hugging Face
### Getting started with pipelines
Hugging Face has an ecosystem of libraries that allow users to leverage
tools at different levels. The `pipeline` module from the `transformers`
library is a great place to get started with performing ML tasks. It
removes the requirement for training models, allowing for quicker
experimentation and results. It does this by being a wrapper around
underlying objects, functions, and processes.
Getting started with pipeline can be done by defining a task or model.
This helps with quick experimentation as you become familiar with the
library.
Create your first pipelines for sentiment analysis. The `input` is a
sentence string that is already loaded for you.
**Instructions**
- Import `pipeline` from the `transformers` library.
- Create the first pipeline by specifying the task "sentiment-analysis"
and save as `task_pipeline`.
- Create another pipeline but only specify the model,
`distilbert-base-uncased-finetuned-sst-2-english` and save as
`model_pipeline`.
- Predict the sentiment of `input` using both pipelines.
**Answer**
```{python}
# Import pipeline
from transformers import pipeline
# Create the task pipeline
task_pipeline = pipeline(task="sentiment-analysis")
# Create the model pipeline
model_pipeline = pipeline(model="distilbert-base-uncased-finetuned-sst-2-english")
# Predict the sentiment
task_output = task_pipeline(input)
model_output = model_pipeline(input)
print(f"Sentiment from task_pipeline: {task_output[0]['label']}; Sentiment from model_pipeline: {model_output[0]['label']}")
```
### Using AutoClasses
AutoClasses offer more control for machine learning tasks, and they can
also be used with `pipeline()` for quick application. It's a nice
balance of control and convenience.
Continue with the sentiment analysis task and combine AutoClasses with
the pipeline module.
`AutoModelForSequenceClassification` and `AutoTokenizer` from the
`transformers` library have already been imported for you and the input
text is saved as `input`.
**Instructions**
- Download the model and tokenizer for
`"distilbert-base-uncased-finetuned-sst-2-english"` and save as
`model` and `tokenizer`, respectively.
- Create the pipeline using `model` and `tokenizer` and save as
`sentimentAnalysis`.
- Predict the output using `sentimentAnalysis` and save as `output`.
**Answer**
```{python}
# Download the model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# Create the pipeline
sentimentAnalysis = pipeline(task="sentiment-analysis", model=model, tokenizer=tokenizer)
# Predict the sentiment
output = sentimentAnalysis(input)
print(f"Sentiment using AutoClasses: {output[0]['label']}")
```
### Comparing models with the pipeline
One of the great benefits of the `pipeline()` module is the ease at
which you can experiment with different models simply by changing the
`"model"` input. This is a good way to determine which model works best
for a particular task or dataset that you are working with.
Experiment with two sentiment analysis models by creating pipelines for
each, then using them to predict the sentiment for a sentence.
`pipeline` from the `transformers` library is already loaded for you.
The example input sentence is saved as `input`.
**Instructions**
- Create a pipeline for labeling text as positive or negative, using the
model `"distilbert-base-uncased-finetuned-sst-2-english"`, and save as
`distil_pipeline`.
- Predict the sentiment for the `input` and save as `distil_output`.
- Repeat the same steps for the model, `"kwang123/bert-sentiment-analysis"` and save as `bert_pipeline` and `bert_output`.
**Answer**
```{python}
# Create the pipeline
distil_pipeline = pipeline(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Predict the sentiment
distil_output = distil_pipeline(input)
# Create the pipeline
distil_pipeline = pipeline(task="sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Predict the sentiment
distil_output = distil_pipeline(input)
# Create the second pipeline and predict the sentiment
bert_pipeline = pipeline(task="sentiment-analysis", model="kwang123/bert-sentiment-analysis")
bert_output = bert_pipeline(input)
print(f"Bert Output: {bert_output[0]['label']}")
print(f"Distil Output: {distil_output[0]['label']}")
```
### Normalizing text
An important step to performing an NLP task is tokenizing the input
text. This makes the text more understandable and manageable for the ML
models, or other algorithms.
Before performing tokenization, it's best to run normalization steps,
i.e. removing white spaces and accents, lowercasing, and more. Each
tokenizer available in Hugging Face uses it's own normalization and
tokenization processes.
Let's take a look at what normalization the `distilbert-base-uncased`
tokenizer applies to the `input_string`, "HOWDY, how aré yoü?".
**Instructions**
- Import `AutoTokenizer` from `transformers`.
- Download the tokenizer for `distilbert-base-uncased` using
`AutoTokenizer` and save as `tokenizer`.
- Run the normalization process for the `tokenizer`.
**Answer**
```{python}
# Import the AutoTokenizer
from transformers import AutoTokenizer
# Download the tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
# Normalize the input string
output = tokenizer.backend_tokenizer.normalizer.normalize_str(input_string)
print(output)
```
### Comparing tokenizer output
Most models in Hugging Face will have an associated tokenizer that will
help prepare the input data based on what the model expects. After
normalization, the tokenizer will split the input into smaller chunks
based on the chosen algorithm. This is known as "pre-tokenization".
Let's explore the different types of pre-tokenization by performing this
process with two tokenizers on the same input. We will be using
`DistilBertTokenizer` and `GPT2Tokenizer` which have already been loaded
for you. The input text string, "Pineapple on pizza is pretty good, I
guess" is saved as `input`.
**Instructions**
- Download the `"gpt2"` tokenizer and save as `gpt_tokenizer`.
- Use `gpt_tokenizer` to create the tokens from the `input`.
- Repeat the same two steps for
`"distilbert-base-uncased-finetuned-sst-2-english"`.
**Answer**
```{python}
# Download the gpt tokenizer
gpt_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Tokenize the input
gpt_tokens = gpt_tokenizer.tokenize(text=input)
# Repeat for distilbert
distil_tokenizer = DistilBertTokenizer.from_pretrained(
"distilbert-base-uncased-finetuned-sst-2-english"
)
distil_tokens = distil_tokenizer.tokenize(text=input)
# Compare the output
print(f"GPT tokenizer: {gpt_tokens}")
print(f"DistilBERT tokenizer: {distil_tokens}")
```
### Grammatical correctness
Text classification is the process of labeling an input text into a
pre-defined category. This can take the form of sentiment - `positive`
or `negative` - spam detection - `spam` or `not spam` - and even
grammatical errors.
Explore the use of a `text-classification` pipeline for checking an
input sentence for grammatical errors.
`pipeline` from the `transformers` library is already loaded for you.
**Instructions**
- Create a pipeline for the task `text-classification` and use the model
`"abdulmatinomotoso/English_Grammar_Checker"`, saving the pipeline as
`classifier`.
- Use the `classifier` to predict the grammatical correctness of the
input sentence provided and save as `output`.
**Answer**
```{python}
# Create a pipeline
classifier = pipeline(
task="text-classification",
model="abdulmatinomotoso/English_Grammar_Checker"
)
# Predict classification
output = classifier("I will walk dog")
print(output)
```
### Question Natural Language Inference
Another task under the text classification umbrella is Question Natural
Language Inference, or QNLI. This determines if a piece of text contains
enough information to answer a posed question. This requires the model
to perform logical reasoning which are important for Q&A applications.
Performing different tasks with the `text-classification` pipeline can
be done by choosing different models. Each model is trained to predict
specific labels and optimized for learning different context within a
text.
`pipeline` from the `transformers` library is already loaded for you.
**Instructions**
- Create a text classification QNLI pipeline using the model
`"cross-encoder/qnli-electra-base"` and save as `classifier`.
- Use this classifier to predict if the text contains information to
answer the question.
**Answer**
```{python}
# Create the pipeline
classifier = pipeline(task="text-classification", model="cross-encoder/qnli-electra-base")
# Predict the output
output = classifier("Where is the capital of France?, Brittany is known for their kouign-amann.")
print(output)
```
### Zero-shot classification
Zero-shot classification is the ability for a transformer to predict a
label from a new set of classes which it wasn't originally trained to
identify. This is possible through its transfer learning capabilities.
It can be an extremely valuable tool.
Hugging Face `pipeline()` also has a `zero-shot-classification` task.
These pipelines require both an input text and candidate labels.
Build a zero-shot classifier to predict the label for the input `text`,
a news headline that has been loaded for you.
`pipelines` from the `transformers` library is already loaded for you.
Note that we are using our own version of the pipeline function to
enable you to learn how to use these functions without having to
download the model.
**Instructions**
- Build the pipeline for a zero-shot-classification task and save as
`classifier`.
- Create a list of the labels - "politics", "science", "sports" - and
save as `candidate_labels`.
- Predict the label of `text` using the classifier and candidate labels.
**Answer**
```{python}
# Build the zero-shot classifier
classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")
# Create the list
candidate_labels = ["politics", "science", "sports"]
# Predict the output
output = classifier(text, candidate_labels)
print(f"Top Label: {output['labels'][0]} with score: {output['scores'][0]}")
```
### Summarizing long text
Summarization is a useful task for reducing large piece of text into
something more manageable. This could be beneficial for multiple reasons
like reducing the amount of time a reader needs to spend to obtain the
important point of a piece of text.
The Hugging Face `pipeline()` task, "summarization", builds a s
summarization pipeline which is a quick way to perform summarization on
a piece of text. You'll do that by creating a new pipeline and using it
to summarize a piece of text from a Wikipedia page on Greece.
`pipeline` from the `transformers` library and the `original_text` have
already been loaded for you.
**Instructions**
- Create the summarization `pipeline` using the task "summarization" and
save as `summarizer`.
- Use the new pipeline to create a summary of the text and save as
`summary_text`.
- Compare the length of the original and summary text.
**Answer**
```{python}
# Create the summarization pipeline
summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum")
# Summarize the text
summary_text = summarizer(original_text)
# Compare the length
print(f"Original text length: {len(original_text)}")
print(f"Summary length: {len(summary_text[0]['summary_text'])}")
```
### Using min_length and max_length
The `pipeline()` function, has two important parameters: `min_length`
and `max_length`. These are useful for adjusting the length of the
resulting summary text to be short, longer, or within a certain number
of words. You might want to do this if there are space constraints
(i.e., small storage), to enhance readability, or improve the quality of
the summary.
You'll experiment with a short and long summarizer by setting these two
parameters to a small range, then a wider range.
`pipeline` from the `transformers` library and the `original_text` have
already been loaded for you.
**Instructions**
- Create a summarization pipeline using a `min_length` of 1 and
`max_length` of 10; save as `short_summarizer`.
- Summarize the `original_text` using the `short_summarizer` and save
the result as `short_summary_text`.
- Repeat these steps for a summarization pipeline that has a minimum length of 50 and maximum of 150; save as `long_summarizer` and `long_summary_text`, respectively.
**Answer**
```{python}
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)
# Summarize the input text
short_summary_text = short_summarizer(original_text)
# Print the short summary
print(short_summary_text[0]["summary_text"])
# Create a short summarizer
short_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=1, max_length=10)
# Summarize the input text
short_summary_text = short_summarizer(original_text)
# Print the short summary
print(short_summary_text[0]["summary_text"])
# Repeat for a long summarizer
long_summarizer = pipeline(task="summarization", model="cnicu/t5-small-booksum", min_length=50, max_length=150)
long_summary_text = long_summarizer(original_text)
# Print the long summary
print(long_summary_text[0]["summary_text"])
```
### Summarizing several inputs
Often times, you'll be working on projects where summarization will
occur over an entire dataset or list of items, not just a single piece
of text. Fortunately, this can be done by passing in a list of text
items. This will return a list of summarized texts.
You'll build a final summarization pipeline and use it to summarize a
list of text items from the `wiki` dataset.
`pipeline` from the `transformers` library and the dataset `wiki` have
already been loaded for you.
**Instructions**
- Create a list of text items to summarize from the `wiki` dataset and
save as `text_to_summarize`.
- Create a summarization pipeline using a `min_length` of 20 and a
`max_length` of 50 and save as `summarizer`.
- Summarize the first three items in the `text_to_summarize` list
setting `truncation` to `True`.
- Create a for-loop that will print each text summary.
**Answer**
```{python}
# Create the list
text_to_summarize = [w["text"] for w in wiki]
# Create the pipeline
summarizer = pipeline("summarization", model="cnicu/t5-small-booksum", min_length=20, max_length=50)
# Summarize each item in the list
summaries = summarizer(text_to_summarize[:3], truncation=True)
# Create for-loop to print each summary
for i in range(0,3):
print(f"Summary {i+1}: {summaries[i]['summary_text']}")
```
## Building Pipelines for Image and Audio
### Processing image data
Just like text inputs, image inputs will typically require
pre-processing before using with a pipeline for an image-based machine
learning task, such as image classification. Some common transformations
include cropping and resizing. Fortunately, Hugging Face provides
modules for performing these steps via the `image_transforms` module in
the `transformers` library.
Use this module to apply a transformation to a fashion image.
<img
src="https://assets.datacamp.com/production/repositories/6536/datasets/b8f0d79998622cb33ed09a3c48f1d20b25712bbe/fashion.jpeg"
height="500" />
`image_transforms` from the `transformers` library has already been
loaded for you, as well as the JPEG saved as `original_image`.
**Instructions**
- Convert the image to a numpy array.
- Crop the center of the image to keep a new 200 x 200 image using
`image_transforms` and save as `cropped_image`.
**Answer**
```{python}
# Create the numpy array
image_array = np.array(original_image)
# Crop the center of the image
cropped_image = image_transforms.center_crop(image=image_array, size=(200, 200))
imgplot = plt.imshow(cropped_image)
plt.show()
```
### Creating an image classifier
Image classification is the process of labeling an image based on the
content. This is useful for many reasons from improving search to saving
agriculture crops from disease. It is also helpful for identifying
clothing items in fashion photos.
Build an image classification pipeline using a model trained on
identifying clothing items to classify the image you just cropped.
Both `pipeline` from the `transformers` library and the image, saved as
`cropped_image`, have already been loaded for you.
**Instructions**
- Create the image classifier pipeline using the model provided and save
as `image_classifier`.
- Predict the class of the `cropped_image` and save as `results`.
- Print the predicted `"label"` of the result.
**Answer**
```{python}
# Create the pipeline
image_classifier = pipeline(task="image-classification",
model="abhishek/autotrain_fashion_mnist_vit_base")
# Predict the class of the image
results = image_classifier(cropped_image)
# Print the results
print(results[0]["label"])
```
### Document question and answering
Document question and answering is a multi-modal ML task which analyzes
an image of a document, such as a contract, converts it to text, and
allows a question to be asked about the text. This is useful when there
are many scanned documents which need to be searched, for example
financial records.
Build a pipeline for document question and answering, then ask the
pre-loaded question `Which meeting is this document about?`.
`pipeline` from the `transformers` library and the `question` are
already loaded for you. Note that we are using our own pipeline and dqa
functions to enable you to learn how to use these functions without some
of the extra setup. Please visit the Hugging Face documentation to dive
deeper.
**Instructions**
- Create a pipeline for `document-question-answering` and save as `dqa`.
- Save the path to the image, `document.png`, as `image`.
- Get the answer for the `question` of the `image` using the `dqa`
pipeline and save as `results`.
**Answer**
```{python}
# Create the pipeline
dqa = pipeline(task="document-question-answering", model="naver-clova-ix/donut-base-finetuned-docvqa")
# Set the image and question
image = "document.png"
question = "Which meeting is this document about?"
# Get the answer
results = dqa(image=image, question=question)
print(results)
```
### Visual question and answering
Visual question and answering is an ML task that attempts to provide the
best answer for a question about an image. The model will analyze the
content of the image and return a label as the answer.
For example, if asking about the clothes a model is wearing, the model
may return clothing items as the label. Such a task can be beneficial
for people who are visually impaired or as a classification method
(similar to image classification but more open-ended).
`pipeline` from the `transformers` library and both the `question` and
`image` are already loaded for you.
**Instructions**
- Create a visual question and answering pipeline by setting the `task`
to `visual-question-answering` and save as `vqa`.
- Use the `vqa` pipeline to get an answer for the `image` and
`question`, then save as `results`.
**Answer**
```{python}
# Create pipeline
vqa = pipeline(task="visual-question-answering", model="dandelin/vilt-b32-finetuned-vqa")
# Use image and question in vqa
results = vqa(image=image, question=question)
print(results)
```
### Resampling audio files
The sampling rate of an audio file determines the resolution. The higher
the sampling rate, the higher the resolution which provides more detail
about the sound wave itself.
When performing ML tasks it is important to ensure each file has the
same sampling rate. This will maintain consistency and prepare the audio
files based on what the model expects regarding number of data points
per audio file.
The dataset, `audio_file`, and `Audio` from the `datasets` library are
already loaded for you.
**Instructions**
- Save the old sampling rate, for reference, as `old_sampling_rate`.
- Resample the `audio` column to a new rate of 16,000 kHz and save to
`audio_file`.
- Compare the old and new sampling rates.
**Answer**
```{python}
# Save the old sampling rate
old_sampling_rate = audio_file[1]["audio"]["sampling_rate"]
# Resample the audio files
audio_file = audio_file.cast_column("audio", Audio(sampling_rate=16_000))
# Compare the old and new sampling rates
print("Old sampling rate:", old_sampling_rate)
print("New sampling rate:", audio_file[1]["audio"]["sampling_rate"])
```
### Filtering out audio files
There will be occasions where you will want, or need, to filter a
dataset based on a specific criteria. A common example of this is
filtering for audio files that are under a specified duration.
The `librosa` and `numpy` libraries, as well as the `dataset`, have
already been loaded for you. Note: we have modified the `librosa`
library for the purposes of this exercise, but the functionality and
pattern is the same.
**Instructions**
- Loop over each `row` of the audio paths in the `dataset` and calculate
the duration, appending to the `old_durations_list`.
- Create a new column called `duration` using `old_durations_list` and
save to `dataset`.
- Filter the `dataset` for audio under 6.0 seconds using a lambda function and the column `duration`; save as `filtered_dataset`.
- Save the new durations as a list called `new_durations_list`.
**Answer**
```{python}
# Create a list of durations
old_durations_list = []
# Loop over the dataset
for row in dataset["path"]:
old_durations_list.append(librosa.get_duration(path=row))
# Create a new column
dataset = dataset.add_column("duration", old_durations_list)
# Create a list of durations
old_durations_list = []
# Loop over the dataset
for row in dataset["path"]:
old_durations_list.append(librosa.get_duration(path=row))
# Create a new column
dataset = dataset.add_column("duration", old_durations_list)
# Filter the dataset
filtered_dataset = dataset.filter(lambda d: d < 6.0, input_columns=["duration"], keep_in_memory=True)
# Create a list of durations
old_durations_list = []
# Loop over the dataset
for row in dataset["path"]:
old_durations_list.append(librosa.get_duration(path=row))
# Create a new column
dataset = dataset.add_column("duration", old_durations_list)
# Filter the dataset
filtered_dataset = dataset.filter(lambda d: d < 6.0, input_columns=["duration"], keep_in_memory=True)
# Save new durations
new_durations_list = filtered_dataset["duration"]
print("Old duration:", np.mean(old_durations_list))
print("New duration:", np.mean(new_durations_list))
```
### Classifying audio files
Audio classification can be used for any task that requires labeling a
piece of audio based on its content. A common use case is identifying
spoken languages.
You will do just that using an example from the `common_language`
dataset. The model, `facebook/mms-lid-126` from Meta is a common model
used for this task given its coverage of languages.
`pipeline` from the `transformers` library as well as the `dataset` have
been loaded for you. It has been modified for the purposes of this
exercise.
**Instructions**
- Create the pipeline for audio classification and save as `classifier`.
- Extract the sample audio and sentence and save as `audio` and
`sentence`, respectively.
- Predict the label for the audio using the `classifier` and save as
`prediction`.
**Answer**
```{python}
# Create the pipeline
classifier = pipeline(task="audio-classification", model="facebook/mms-lid-126")
# Extract the sample
audio = dataset[1]["audio"]["array"]
sentence = dataset[1]["sentence"]
# Predict the language
prediction = classifier(audio)
print(f"Predicted language is '{prediction[0]['label'].upper()}' for the sentence '{sentence}'")
```
### Instantiating an ASR pipeline
You've been tasked with generating text from a dataset of audio files.
Accuracy is important, so you need to make sure you choose the best
model for automatic speech recognition. You also don't have the time to
train your own model.
Compare the predictions between the Wav2Vec and Whisper models by
instantiating two pipelines for automatic speech recognition. You want
to test out the functionality, so you should try it out on one example
first.
`pipeline` from the `transformers` package is already loaded for you.
Likewise, the dataset has already been loaded for you, resampled, and
saved as `english`. One audio file with it's associated metadata is
saved as `example`.
**Instructions**
- Instantiate the first automatic speech recognition pipeline for the
"wav2vec2" model from Meta.
- Predict the text from the `example` audio.
- Repeat these two steps for the "whisper-tiny" model from OpenAI in
order to compare the predictions.
**Answer**
```{python}
# Create an ASR pipeline using Meta's wav2vec model
meta_asr = pipeline(task="automatic-speech-recognition", model="facebook/wav2vec2-base-960h")
# Predict the text from the example audio
meta_pred = meta_asr(example["audio"]["array"])["text"].lower()
# Repeat for OpenAI's Whisper model
open_asr = pipeline("automatic-speech-recognition", model="openai/whisper-tiny")
open_pred = open_asr(example["audio"]["array"])["text"].lower()
# Print the prediction from both models
print("META:", meta_pred)
print("OPENAI:", open_pred)
```
### Word error rate
The Wav2Vec and Whisper models predicted very similar text with only
some minor differences. Luckily, for the first this example record, you
have the true sentence for reference. You can use Word Error Rate (WER)
to determine which model quantitatively performed the best.
`load` from the `evaluate` package has already loaded for you. Likewise,
the `example` and predictions - `meta_pred` and `open_pred` - were saved
from the previous exercise.
**Instructions**
- Instantiate the word error rate metric object and save as `wer`.
- Save the true sentence of the example as `true_sentence`.
- Compute the word error rate for each model prediction and save as
`meta_wer` and `open_wer`, respectively.
**Answer**
```{python}
# Create the word error rate metric
wer = load("wer")
# Save the true sentence of the example
true_sentence = example["sentence"].lower()
# Compute the wer for each model prediction
meta_wer = wer.compute(predictions=[meta_pred], references=[true_sentence])
open_wer = wer.compute(predictions=[open_pred], references=[true_sentence])
print(f"The WER for the Meta model is {meta_wer} and for the OpenAI model is {open_wer}")
```
### Iterating over a dataset
You were able test the functionality and understand the performance on
one example from the dataset. Now, let's evaluate the models over the
first 100 audio files to make a final decision about which is best for
this dataset.
In order to do this efficiently, you can create a function that will
iterate over the rows of the dataset and yield a set of audio and true
sentence pairs on each iteration.
The dataset, `english`, ASR models - `meta_asr` and `open_asr` - and
`pandas` have all been loaded for you.
**Instructions**
- Create the `data()` function to iterate over the first 3 rows of the
dataset.
- Within `data()`, predict the text for each audio file using both the
`meta_asr` and `open_asr` pipelines.
- Append the results as a dictionary in the `output` list.
**Answer**
```{python}
# Create the data function
def data(n=3):
for i in range(n):