1
00:00:05,330 --> 00:00:12,500
Now we'll proceed with last six Project three on August, scene recognition with microphone specifically

2
00:00:12,800 --> 00:00:14,570
for theory and data collection.

3
00:00:15,080 --> 00:00:16,190
Here we are.

4
00:00:16,190 --> 00:00:21,740
Terminal and edge impulse will be used in this project to train and deploy an adjacent classifier.

5
00:00:22,700 --> 00:00:29,030
We will learn how to do this object scene classification is actually a task in which a machine learning

6
00:00:29,030 --> 00:00:38,360
model has to figure out what kind of sound a piece of a nodule is like a crying baby cough, a dog barking,

7
00:00:38,360 --> 00:00:39,140
and so on.

8
00:00:39,680 --> 00:00:43,130
Now, let's learn more about how computers processes sounds.

9
00:00:43,820 --> 00:00:50,480
The sound is a vibration that moves through a transmission medium like a gas liquid or solid.

10
00:00:50,960 --> 00:00:54,230
It moves through the medium as an acoustically.

11
00:00:54,980 --> 00:00:59,750
Now there are molecules in the surrounding medium that are being pushed by the source of the sound.

12
00:01:00,290 --> 00:01:04,820
They push the molecules that are next to them and so on and so forth.

13
00:01:05,330 --> 00:01:08,570
When they touch another object, it also vibrates a little.

14
00:01:09,020 --> 00:01:10,880
Now this is how microphones work.

15
00:01:11,330 --> 00:01:18,350
The microphone membrane is pushed in more by the molecules in, then moves back to where it was before.

16
00:01:19,430 --> 00:01:26,810
Theoretically, in the circuit voltage is related to a sound amplitude, so the louder the sound, the

17
00:01:26,810 --> 00:01:28,220
more the membrane is pushed.

18
00:01:28,730 --> 00:01:31,010
This causes alternating current in the circuit.

19
00:01:31,430 --> 00:01:41,330
As a sampling rate, we take a measurement of a sound 8000 times a second, which is called as an analog

20
00:01:41,330 --> 00:01:42,650
to digital converter.

21
00:01:43,220 --> 00:01:49,910
We then read this voltage with this converter and recorded at the same time every time so that we get

22
00:01:49,910 --> 00:01:52,340
the same amount of data each time we do this.

23
00:01:52,940 --> 00:01:58,250
If we sampled too slowly, we might not get at all the important parts of the sound if we do it too

24
00:01:58,250 --> 00:01:58,760
quickly.

25
00:01:59,480 --> 00:02:03,170
The numbers we use to record sound digitally also play a role.

26
00:02:03,860 --> 00:02:09,620
The more nuances we can keep from the original sound, the more range of a number we use.

27
00:02:10,010 --> 00:02:17,870
Now that is called the audio bit that you might have heard terms like eight bit sound and 16 bit sound.

28
00:02:18,350 --> 00:02:26,210
Now it refers to how many bits are in the sound for eight bits sound and an eight bit integer with a

29
00:02:26,210 --> 00:02:31,880
range of zero to 255 issues, which is what it sees in the box.

30
00:02:32,360 --> 00:02:33,680
It's negative.

31
00:02:34,040 --> 00:02:38,480
327 six eight two three two seven six seven.

32
00:02:39,400 --> 00:02:39,820
All right.

33
00:02:40,360 --> 00:02:46,240
So at the end, we have a long string of numbers with bigger numbers indicating louder parts of the

34
00:02:46,240 --> 00:02:53,740
sound, we can think of it like this this one second of gunshot sound recorded at an 8000 hertz frequency

35
00:02:53,740 --> 00:02:57,850
with an eight bit depth of zero to 255.

36
00:02:58,300 --> 00:03:00,250
And this is the example of bit.

37
00:03:02,390 --> 00:03:06,860
Now we can cut and paste parts of it, but for studying sound, it's too low.

38
00:03:07,840 --> 00:03:14,500
It's at this point that four year transform mail skills spectrograms and cepstral coefficients come

39
00:03:14,500 --> 00:03:15,010
into play.

40
00:03:15,280 --> 00:03:21,160
And part of this project will talk about the four year transform, which is a mathematical transformation

41
00:03:21,430 --> 00:03:26,680
that lets us break down into each individual frequencies and its amplitude.

42
00:03:27,160 --> 00:03:32,620
Or if you'd like to use a metaphor, given the smoothie it outputs the recipe.

43
00:03:33,280 --> 00:03:37,990
That is how our sound looks like after applying the for you to transform.

44
00:03:38,470 --> 00:03:43,060
So the higher dive bars correspond to larger amplitude frequencies.

45
00:03:43,660 --> 00:03:49,960
Now, for an example, if we want to reduce the size of an audio file, we can get a low value of frequencies.

46
00:03:50,140 --> 00:03:52,740
We can also remove noise or the sound of a voice.

47
00:03:52,870 --> 00:03:58,600
For example, if we do a four year transform, we list all the information about the time of the signal,

48
00:03:58,930 --> 00:04:01,990
which is not good for things like human speech that don't repeat.

49
00:04:02,590 --> 00:04:06,130
They give signal sample and do a four year transform on it multiple times.

50
00:04:06,580 --> 00:04:13,090
This is like cutting it up and then putting the data from the multiple for you transforms back together

51
00:04:13,090 --> 00:04:14,170
to make a spectrogram.

52
00:04:14,860 --> 00:04:23,860
No, here x axis is the time the y axis is the frequency and the amplitude of the frequency is expressed

53
00:04:23,860 --> 00:04:24,910
through a color.

54
00:04:25,360 --> 00:04:28,060
Brighter colors correspond to a larger amplitude.

55
00:04:28,990 --> 00:04:31,750
Now, I believe you can now recognize this sound.

56
00:04:32,440 --> 00:04:38,800
No, maybe yes, or maybe there is too much information in the normal spectrum.

57
00:04:39,040 --> 00:04:42,940
If we only want to find sounds that the human ear can hear.

58
00:04:43,330 --> 00:04:47,860
Studies have shown that humans don't see frequencies in a straight line.

59
00:04:48,580 --> 00:04:55,390
As humans, we can better tell the difference in lower frequencies than we can tell the differences

60
00:04:55,780 --> 00:04:57,190
in higher frequencies.

61
00:04:57,610 --> 00:05:03,910
No, it's not very easy for us to tell the difference between 500 and 1000 hertz.

62
00:05:04,240 --> 00:05:06,550
Even though the distance between them is the same.

63
00:05:07,670 --> 00:05:15,770
In 1937, Stephen Volkmann, a new man, came up with a way to measure pitch so that equal distances

64
00:05:16,130 --> 00:05:18,260
in beach sounded the same to the listener.

65
00:05:18,680 --> 00:05:20,600
Now this scale is called the bell.

66
00:05:22,070 --> 00:05:29,480
Now it says here a Mel Spectrum is a spectrogram where the frequencies are converted to the Mel scale,

67
00:05:29,960 --> 00:05:32,480
there are more steps involved for recognizing speech.

68
00:05:32,780 --> 00:05:40,580
For example, cepstral coefficients that we mentioned above and we will be discussing them further as

69
00:05:40,580 --> 00:05:41,780
we go on with our lesson.

70
00:05:42,260 --> 00:05:45,830
Now it is time to finally start with the practical implementation.

71
00:05:46,130 --> 00:05:51,800
To start our preparation, install Anaconda Environment Manager if you didn't install it in the first

72
00:05:51,800 --> 00:06:00,320
lesson, see lesson one Lesson one Introduction to tiny amounts with real terminals for information

73
00:06:00,320 --> 00:06:03,980
on how to install anaconda and create a virtual environment.

74
00:06:04,730 --> 00:06:08,560
Then, in a virtual environment, install your browser with PIP.

75
00:06:08,630 --> 00:06:16,940
Install your browser and conduct install slash Secunda slash forge space SFM.

76
00:06:18,440 --> 00:06:23,330
We need to take note the audio signal needs to be sampled at a very high sampling rate.

77
00:06:23,690 --> 00:06:28,550
It will be best if it is eight thousand hertz or ideally sixteen thousand hertz.

78
00:06:29,030 --> 00:06:32,690
Edge inputs data for cool is too slow to handle this sampling rate.

79
00:06:33,020 --> 00:06:38,330
So we will need to use dedicated data collection framework to get the data for this project.

80
00:06:38,870 --> 00:06:42,230
So make sure to download a new version of real terminology.

81
00:06:42,230 --> 00:06:48,920
Must be married with microphone support and flash it to your device as described in Latin for.

82
00:06:49,490 --> 00:06:56,300
After that date, a new project on edge impulsive not for lunch edge impulsive ingestion service.

83
00:06:57,550 --> 00:07:04,030
For the next step, if you use Ajay Devgn before it, you will need to add clean the document above

84
00:07:04,030 --> 00:07:08,980
to clean project down, then log in with your credentials and choose a project.

85
00:07:09,310 --> 00:07:14,770
You have just pretty darn good data acquisition tab and you can start getting data samples.

86
00:07:15,340 --> 00:07:17,260
We will have three classes of data.

87
00:07:17,620 --> 00:07:26,590
We have the ground coffee trade now record 10 samples for each class 5000 milliseconds duration each.

88
00:07:27,250 --> 00:07:31,660
You can record the data sounds split from the computer speakers, except for background glass.

89
00:07:32,290 --> 00:07:37,810
But if you have the opportunity to record real sounds, that would be even better for background glass

90
00:07:38,110 --> 00:07:41,320
record sounds that should not be classified as coughing or crying.

91
00:07:41,380 --> 00:07:43,180
Example people talking.

92
00:07:43,630 --> 00:07:44,290
No sounds.

93
00:07:44,560 --> 00:07:45,970
Air conditioning or fun.

94
00:07:46,330 --> 00:07:46,810
And so on.

95
00:07:46,810 --> 00:07:47,380
And so forth.

96
00:07:49,510 --> 00:07:53,830
Thirty samples is way too small, so we're also going to put up more data.

97
00:07:54,130 --> 00:08:00,370
You can download the sounds from the internet, recycle them to 16000 hertz and to save them to this

98
00:08:00,370 --> 00:08:04,180
type of format, the one format for the WHV 400.

99
00:08:04,510 --> 00:08:08,890
But this converter strip, you can also use this converter script to change the format.

100
00:08:09,820 --> 00:08:18,130
Now, copy the code and base it in the text document by using Notepad plus the ideal ID or other suitable

101
00:08:18,130 --> 00:08:18,400
ID.

102
00:08:19,120 --> 00:08:21,010
Do not use window default notepad.

103
00:08:21,760 --> 00:08:28,210
Now save document as converter that thewhite and then from an account environment run python converter.

104
00:08:28,680 --> 00:08:36,490
The thewhite name of the downloaded file class underscore name the number that W or the waveform.

105
00:08:37,590 --> 00:08:42,540
And then right after dark, you can find examples, sound files already converted to write formatting

106
00:08:42,540 --> 00:08:48,510
materials for this course, then split all the song samples to leave only the interesting pieces.

107
00:08:48,900 --> 00:08:51,660
Now do that for every class, except for background.

108
00:08:52,530 --> 00:08:58,230
After that collection is done, it is time to choose processing blocks and define our neural network

109
00:08:58,230 --> 00:08:58,590
model.

110
00:08:59,430 --> 00:09:05,700
Lastly, make sure to record in different environments such as outside, on the street, inside of the

111
00:09:05,700 --> 00:09:08,100
classroom or anywhere you want.