1
00:00:00,060 --> 00:00:04,770
So, no, let's take a look at how we use multiple GPUs in PyTorch lightning.

2
00:00:05,340 --> 00:00:11,730
Now, firstly, we can't actually run this on Google Cloud because Google Cloud only has one GPU, so

3
00:00:11,730 --> 00:00:14,760
we can't actually in practice, test this out, unfortunately.

4
00:00:15,330 --> 00:00:20,010
However, you can follow this notebook for guidelines if you have to use install in system.

5
00:00:20,310 --> 00:00:25,380
You can definitely utilize both of them in training and speed up your training time by by double.

6
00:00:26,070 --> 00:00:28,090
So let's take a look and see how we do that.

7
00:00:28,110 --> 00:00:31,830
So firstly, you're going to have to do some code clean up a bit.

8
00:00:31,830 --> 00:00:34,980
So let's take a look at the good habits that you need to establish.

9
00:00:35,070 --> 00:00:38,730
Firstly, delete all your Okuda or the two calls.

10
00:00:39,210 --> 00:00:42,140
So those are the calls that send data or send.

11
00:00:42,790 --> 00:00:47,070
They'll actually reduce sentences to the GPU or to the CPU.

12
00:00:47,430 --> 00:00:50,730
We need to delete these lines of the code next.

13
00:00:50,850 --> 00:00:53,670
What we have to do is synchronize validation and test logging.

14
00:00:54,090 --> 00:00:56,520
So when you're logging your logs, remember yes.

15
00:00:56,690 --> 00:01:00,090
Is this an evaluation step in and test up as well in the training step?

16
00:01:00,540 --> 00:01:02,190
You have this self-taught log.

17
00:01:02,370 --> 00:01:07,710
And then what we need to set here is sync distribution equal to true.

18
00:01:08,580 --> 00:01:09,330
What does does?

19
00:01:09,330 --> 00:01:11,310
This sets the distributed mode to true.

20
00:01:11,760 --> 00:01:18,540
So now we can actually have to put GPU is synchronized properly when logging or otherwise it could.

21
00:01:18,810 --> 00:01:20,010
It could get messy.

22
00:01:20,040 --> 00:01:24,210
So these are the things that play towards the lightning is taking care of kill for us.

23
00:01:24,540 --> 00:01:25,560
So it's quite convenient.

24
00:01:26,340 --> 00:01:32,280
So you can take a look at a document if you want to see more and more of the documentation, here is

25
00:01:32,280 --> 00:01:33,850
a typo right there.

26
00:01:33,870 --> 00:01:35,430
Let me just fix that.

27
00:01:35,670 --> 00:01:36,900
It's actually two Typekit.

28
00:01:36,900 --> 00:01:40,800
This happens in color because there's no spell check.

29
00:01:41,550 --> 00:01:46,320
Anyway, this here is the PyTorch model that we've been using previously.

30
00:01:46,830 --> 00:01:53,190
However, this is the one that supports multiple GPUs, and you can see we have this here.

31
00:01:53,550 --> 00:01:55,860
Actually, we don't have it, you know, training step.

32
00:01:55,860 --> 00:01:57,900
It's only in the validation steps, actually.

33
00:01:58,380 --> 00:02:00,790
So you have the sync distribution.

34
00:02:00,810 --> 00:02:03,420
There should be the equal true sync sync test.

35
00:02:04,170 --> 00:02:05,670
And that's it.

36
00:02:06,090 --> 00:02:07,200
This is ready to go.

37
00:02:07,210 --> 00:02:10,740
All those dot CUDA and the two calls have been deleted from that code.

38
00:02:11,460 --> 00:02:12,750
So you were ready to go.

39
00:02:12,990 --> 00:02:14,250
This code can be used now.

40
00:02:14,940 --> 00:02:16,350
So let's run this.

41
00:02:17,160 --> 00:02:18,990
Keep track of it now.

42
00:02:19,410 --> 00:02:19,770
There isn't.

43
00:02:19,770 --> 00:02:23,730
Everything is commented here because it's not going to really run in CoLab.

44
00:02:24,180 --> 00:02:29,370
But this is an example of how we can use different GPUs so he can list the GPUs, get a range of them

45
00:02:29,370 --> 00:02:37,950
here available, or you can specify CPU 0g if you run can do it this way as well, or you can set this

46
00:02:37,950 --> 00:02:39,840
one to use all GPUs minus one.

47
00:02:43,110 --> 00:02:49,820
So this is the way you can configure using multiple GPUs or different combinations of GPUs in PI to

48
00:02:49,830 --> 00:02:50,430
watch lightning.

49
00:02:51,000 --> 00:02:54,630
So next, we'll take a look at how we actually run that, and it's actually quite simple.

50
00:02:54,630 --> 00:02:59,670
Whatever configuration you want to use, like if you want to use this configuration is what I've commonly

51
00:02:59,670 --> 00:03:00,510
seen used.

52
00:03:00,960 --> 00:03:05,910
You can just place that argument right here with GPUs line.

53
00:03:05,910 --> 00:03:08,490
Replace it here and you can just run this.

54
00:03:08,490 --> 00:03:13,680
So this won't work because we don't have multiple GPUs in the club.

55
00:03:13,800 --> 00:03:19,470
However, this is the way to do it, and it should work on you on multiple CPU configured system.

56
00:03:20,310 --> 00:03:22,160
So next, we'll take a look at the profiler.

57
00:03:22,200 --> 00:03:25,560
So what is a profile of a profiler as a way to?

58
00:03:25,560 --> 00:03:25,920
You can.

59
00:03:26,130 --> 00:03:28,410
You can use this as a detail in software engineering.

60
00:03:28,740 --> 00:03:32,250
You can profile each line of your code and see how long it takes.

61
00:03:32,370 --> 00:03:37,670
This is a very good way to find bottlenecks in your training process, and there are lots of buff lakes

62
00:03:37,680 --> 00:03:42,450
and deep learning training, as you can tell, is a lot of the issues with data loading data transformations.

63
00:03:43,080 --> 00:03:48,220
There's a lot of other stuff that just passing data around from CPU to CPUs, that type of stuff.

64
00:03:48,240 --> 00:03:54,890
So by running this here what it does, it runs it from one e-book, which we specify here, and we can

65
00:03:54,900 --> 00:03:59,040
set it up with simple, which gives us a lot of information still.

66
00:03:59,520 --> 00:04:01,080
You can see this here.

67
00:04:01,080 --> 00:04:03,420
You can see this are all the functions that are running here.

68
00:04:03,990 --> 00:04:07,080
This is a time it takes to run a number of times.

69
00:04:07,080 --> 00:04:13,260
It executes the total time it took and the percentage of time that was attributed to that.

70
00:04:13,740 --> 00:04:15,810
So you can see what functions are quite fast.

71
00:04:15,810 --> 00:04:18,120
These here and the one is quite slow.

72
00:04:18,120 --> 00:04:18,990
It's understandable.

73
00:04:18,990 --> 00:04:24,390
It's running training e-book that takes roughly just over a minute and several seconds.

74
00:04:24,960 --> 00:04:31,440
But you can see things like get training batch that has some delay as well as as well as these things

75
00:04:31,440 --> 00:04:31,730
here.

76
00:04:31,740 --> 00:04:34,310
So you can definitely see what's going on.

77
00:04:34,320 --> 00:04:35,220
And that's quite cool.

78
00:04:35,340 --> 00:04:40,920
It's quite a good way to assess the performance of your modules, and you can see how much clothes as

79
00:04:40,920 --> 00:04:43,890
well here and total time attributed to it.

80
00:04:44,160 --> 00:04:45,300
So that's pretty cool.

81
00:04:46,350 --> 00:04:48,990
Now let's take a look at training on top use.

82
00:04:49,200 --> 00:04:50,880
So what exactly is a TPU?

83
00:04:51,300 --> 00:04:55,350
Well, the TPU is a hardware accelerator that was developed by Google.

84
00:04:55,620 --> 00:04:59,820
It's called Circuit Short for Tensor Processing Unit, and it was.

85
00:04:59,890 --> 00:05:03,060
Designed specifically for deep learning applications.

86
00:05:03,530 --> 00:05:09,100
In fact, you can take a look at this here TPU processing unit has eight cores and each cores optimized

87
00:05:09,100 --> 00:05:12,760
with 128 by 128 matrix multiples.

88
00:05:13,210 --> 00:05:19,480
So that means a single CPU is about as fast as five A100 GPUs, so that's pretty cool.

89
00:05:19,540 --> 00:05:22,570
I'm not sure what TPU unit you get in call up right now.

90
00:05:22,990 --> 00:05:28,900
It changes from time to time, but nevertheless we hopefully we can use the TPU right now.

91
00:05:29,290 --> 00:05:34,170
However, there's an issue right now when I created this notebook and was working.

92
00:05:34,180 --> 00:05:40,300
This is following the instructions on the from the official API to watch the Lightning Library to use

93
00:05:40,300 --> 00:05:40,800
an colab.

94
00:05:40,810 --> 00:05:44,080
However, it's a bit broken at this installs correctly.

95
00:05:44,590 --> 00:05:51,740
However, it's not loading something properly here with the TPU support in PyTorch and because of a

96
00:05:51,760 --> 00:05:55,630
site on a C Python issue, perhaps not entirely sure.

97
00:05:56,500 --> 00:05:59,520
So this this code is broken right now.

98
00:05:59,530 --> 00:06:02,680
Unfortunately, however, hopefully Google does fix it.

99
00:06:02,830 --> 00:06:09,570
Maybe, and we can actually start treating almost all networks using TWRP use here again with Pi to

100
00:06:09,580 --> 00:06:10,120
us lightning.

101
00:06:10,120 --> 00:06:14,560
And it actually trains this network much faster than the GPU, which is pretty cool to observe.

102
00:06:15,190 --> 00:06:16,690
So we'll stop there for now.

103
00:06:16,690 --> 00:06:20,170
And next, we'll take a look at something that's very important.

104
00:06:20,170 --> 00:06:22,330
It's called transfer learning and fine tuning.

105
00:06:22,930 --> 00:06:26,800
That's basically almost everything that deep learning practitioners like myself do.

106
00:06:27,250 --> 00:06:28,840
So we will get started with that.

107
00:06:29,410 --> 00:06:30,100
Stay tuned.
