WEBVTT

1
00:00:00.660 --> 00:00:01.800
<v Maximilian>So let's start simple.</v>

2
00:00:01.800 --> 00:00:06.180
Let's try to find out which kind of hardware we need

3
00:00:06.180 --> 00:00:09.540
in order to run a large language model

4
00:00:09.540 --> 00:00:11.850
on our local machine, on our system,

5
00:00:11.850 --> 00:00:13.980
or on that server you may be rented.

6
00:00:13.980 --> 00:00:15.750
Doesn't really matter here.

7
00:00:15.750 --> 00:00:19.410
So we wanna use a large language model for inference,

8
00:00:19.410 --> 00:00:20.370
as it's also called.

9
00:00:20.370 --> 00:00:22.980
And inference is simply the process

10
00:00:22.980 --> 00:00:25.530
of using a trained large language model,

11
00:00:25.530 --> 00:00:27.930
no matter if it's open or proprietary,

12
00:00:27.930 --> 00:00:29.220
to generate output.

13
00:00:29.220 --> 00:00:31.080
So we're done with the training phase,

14
00:00:31.080 --> 00:00:32.250
we've got our model,

15
00:00:32.250 --> 00:00:35.010
and now we wanna use it to generate output.

16
00:00:35.010 --> 00:00:37.350
And that's of course exactly what's going on

17
00:00:37.350 --> 00:00:38.640
behind the scenes

18
00:00:38.640 --> 00:00:41.190
when you send a message to ChatGPT

19
00:00:41.190 --> 00:00:43.920
or one of those other AI chat bots.

20
00:00:43.920 --> 00:00:48.060
That message is processed by the ChatGPT application

21
00:00:48.060 --> 00:00:50.970
and then sent to the underlying AI model

22
00:00:50.970 --> 00:00:53.430
to generate new output

23
00:00:53.430 --> 00:00:55.590
based on that message.

24
00:00:55.590 --> 00:00:56.700
That's inference,

25
00:00:56.700 --> 00:00:59.730
it's the trained model generating output

26
00:00:59.730 --> 00:01:03.120
based on some input, based on your prompt.

27
00:01:03.120 --> 00:01:05.820
So in order to perform this inference,

28
00:01:05.820 --> 00:01:09.450
this trained model must be executed somewhere,

29
00:01:09.450 --> 00:01:12.960
must be hosted somewhere, you could say.

30
00:01:12.960 --> 00:01:15.150
So in case of ChatGPT,

31
00:01:15.150 --> 00:01:17.400
that would of course be the servers

32
00:01:17.400 --> 00:01:20.340
owned or rented by OpenAI.

33
00:01:20.340 --> 00:01:22.950
When talking about open models,

34
00:01:22.950 --> 00:01:25.980
like the ones you find on Hugging Face,

35
00:01:25.980 --> 00:01:29.490
you could either also find some paid hosting provider,

36
00:01:29.490 --> 00:01:31.710
or what is the focus of this course,

37
00:01:31.710 --> 00:01:33.900
host them locally on your system

38
00:01:33.900 --> 00:01:36.210
or on servers owned by you.

39
00:01:36.210 --> 00:01:38.430
And therefore, this hosting system

40
00:01:38.430 --> 00:01:42.660
of course needs to meet certain hardware requirements.

41
00:01:42.660 --> 00:01:44.640
For example, it would be great

42
00:01:44.640 --> 00:01:47.850
if the underlying machine that runs the model

43
00:01:47.850 --> 00:01:49.680
has a good GPU,

44
00:01:49.680 --> 00:01:51.660
but you don't need that.

45
00:01:51.660 --> 00:01:54.180
You can also follow along with this course

46
00:01:54.180 --> 00:01:57.300
and run large language models locally on your laptop,

47
00:01:57.300 --> 00:01:58.590
your system,

48
00:01:58.590 --> 00:02:00.360
if you don't have a GPU at all,

49
00:02:00.360 --> 00:02:01.800
leave alone a good one.

50
00:02:01.800 --> 00:02:03.870
You can also run them on the CPU,

51
00:02:03.870 --> 00:02:05.520
it will just be slower.

52
00:02:05.520 --> 00:02:06.870
Now, why is that the case?

53
00:02:06.870 --> 00:02:08.370
Because these models,

54
00:02:08.370 --> 00:02:11.100
when they receive some input,

55
00:02:11.100 --> 00:02:15.180
need to perform many calculations in parallel

56
00:02:15.180 --> 00:02:17.430
in order to produce that output.

57
00:02:17.430 --> 00:02:19.320
And it turns out that GPUs

58
00:02:19.320 --> 00:02:24.210
are amazing at performing many calculations in parallel.

59
00:02:24.210 --> 00:02:26.340
A CPU, on the other hand,

60
00:02:26.340 --> 00:02:28.380
can switch between different tasks,

61
00:02:28.380 --> 00:02:32.280
but can only process them in sequence in the end.

62
00:02:32.280 --> 00:02:34.350
That's why GPUs are preferred

63
00:02:34.350 --> 00:02:36.420
to handle all these computations

64
00:02:36.420 --> 00:02:38.880
that need to be performed in parallel

65
00:02:38.880 --> 00:02:41.160
when performing inference.

66
00:02:41.160 --> 00:02:42.720
So in an ideal world,

67
00:02:42.720 --> 00:02:44.460
we have a great GPU,

68
00:02:44.460 --> 00:02:47.100
but a CPU will also do.

69
00:02:47.100 --> 00:02:50.820
In addition to a GPU or CPU,

70
00:02:50.820 --> 00:02:53.880
preferably a good GPU, as mentioned,

71
00:02:53.880 --> 00:02:58.800
we also need enough RAM or video RAM preferably.

72
00:02:58.800 --> 00:03:00.840
Now, video RAM is the memory

73
00:03:00.840 --> 00:03:03.510
that's directly attached to the GPU.

74
00:03:03.510 --> 00:03:04.740
So if we have a GPU,

75
00:03:04.740 --> 00:03:06.480
we typically also have VRAM,

76
00:03:06.480 --> 00:03:07.710
and that's preferably

77
00:03:07.710 --> 00:03:11.100
because the GPU has more direct and quicker access

78
00:03:11.100 --> 00:03:12.330
to the VRAM

79
00:03:12.330 --> 00:03:15.090
than to the regular system memory.

80
00:03:15.090 --> 00:03:16.950
But again, if you don't have a GPU,

81
00:03:16.950 --> 00:03:20.160
regular RAM, your regular system memory,

82
00:03:20.160 --> 00:03:21.810
will also do.

83
00:03:21.810 --> 00:03:22.770
Why do we need that?

84
00:03:22.770 --> 00:03:24.480
Well, because the model,

85
00:03:24.480 --> 00:03:26.640
all its parameters essentially,

86
00:03:26.640 --> 00:03:28.950
and also the context window,

87
00:03:28.950 --> 00:03:31.320
which essentially is the chat history,

88
00:03:31.320 --> 00:03:34.290
your input messages and the output messages,

89
00:03:34.290 --> 00:03:38.010
all that must be loaded into memory during inference,

90
00:03:38.010 --> 00:03:40.650
and it must stay in memory.

91
00:03:40.650 --> 00:03:42.960
So you must have enough memory,

92
00:03:42.960 --> 00:03:44.520
preferably VRAM,

93
00:03:44.520 --> 00:03:45.960
to load the entire model

94
00:03:45.960 --> 00:03:47.190
with all its parameters

95
00:03:47.190 --> 00:03:50.640
and the entire context window into memory.

96
00:03:50.640 --> 00:03:52.530
Now, if you don't have a lot of memory,

97
00:03:52.530 --> 00:03:54.570
that's not necessarily a problem,

98
00:03:54.570 --> 00:03:58.800
you just won't be able to load the bigger models,

99
00:03:58.800 --> 00:04:01.980
but you will be able to load smaller models.

100
00:04:01.980 --> 00:04:04.260
And we'll soon also learn about a technique

101
00:04:04.260 --> 00:04:07.020
that reduces the hardware requirements

102
00:04:07.020 --> 00:04:08.520
of all those models

103
00:04:08.520 --> 00:04:10.560
so that even the bigger models

104
00:04:10.560 --> 00:04:15.060
can be ran and can be used on normal laptops,

105
00:04:15.060 --> 00:04:17.040
especially high-end laptops.

106
00:04:17.040 --> 00:04:19.293
You don't need a supercomputer.

