WEBVTT

1
00:00:00.240 --> 00:00:02.580
<v ->So in this video, we're going to dive deep</v>

2
00:00:02.580 --> 00:00:05.910
into the context flow when using subagents

3
00:00:05.910 --> 00:00:10.260
and we're going to understand why subagent are so powerful

4
00:00:10.260 --> 00:00:11.880
and so useful.

5
00:00:11.880 --> 00:00:14.070
We have here the main agent thread,

6
00:00:14.070 --> 00:00:16.410
and this is our main conversation.

7
00:00:16.410 --> 00:00:19.050
And every message that we're going to be sending here

8
00:00:19.050 --> 00:00:22.290
is going to increase our token size.

9
00:00:22.290 --> 00:00:25.440
Now, when the main agent decides

10
00:00:25.440 --> 00:00:27.720
that it's going to use a subagent,

11
00:00:27.720 --> 00:00:30.570
it's going to create a brand new prompt,

12
00:00:30.570 --> 00:00:33.890
and this prompt is going to be passed into the subagent.

13
00:00:33.890 --> 00:00:38.280
And this is the only context that the subagent sees

14
00:00:38.280 --> 00:00:40.050
when it start to execute.

15
00:00:40.050 --> 00:00:42.510
So it's not aware of the entire conversation

16
00:00:42.510 --> 00:00:45.540
of what happened so far, it's just aware of the prompt

17
00:00:45.540 --> 00:00:48.480
that the main agent has generated to us.

18
00:00:48.480 --> 00:00:50.250
And I'm going to show you a way

19
00:00:50.250 --> 00:00:52.980
where we can manipulate this prompt

20
00:00:52.980 --> 00:00:56.310
where the main agent is going to prompt the subagent

21
00:00:56.310 --> 00:00:58.290
such that the subagent

22
00:00:58.290 --> 00:01:00.720
is going to have much easier time working

23
00:01:00.720 --> 00:01:02.880
and we're going to get better results.

24
00:01:02.880 --> 00:01:06.270
So we can totally influence this prompt over here.

25
00:01:06.270 --> 00:01:09.090
So when we start to execute the subagent,

26
00:01:09.090 --> 00:01:11.130
it's only as good as this prompt

27
00:01:11.130 --> 00:01:12.960
which is going to be passed into it

28
00:01:12.960 --> 00:01:15.870
because that's the only context that is going to receive.

29
00:01:15.870 --> 00:01:17.850
And it's going to work independently.

30
00:01:17.850 --> 00:01:20.160
It's going to invoke the tools that it needs.

31
00:01:20.160 --> 00:01:22.560
It's going to maybe do some integrations

32
00:01:22.560 --> 00:01:23.700
and at the end,

33
00:01:23.700 --> 00:01:27.390
it's going to return one condensed response

34
00:01:27.390 --> 00:01:29.430
to that main agent.

35
00:01:29.430 --> 00:01:32.190
And every time we're going to spawn a new subagent,

36
00:01:32.190 --> 00:01:35.400
then we're going to start with a fresh context

37
00:01:35.400 --> 00:01:37.500
with only this prompt being sent.

38
00:01:37.500 --> 00:01:39.180
Now you can see the advantage here

39
00:01:39.180 --> 00:01:43.590
is that we can run the main thread and we can keep it lean.

40
00:01:43.590 --> 00:01:45.660
And if we're going to use subagents,

41
00:01:45.660 --> 00:01:49.080
we're going to delegate a lot of context to those subagents,

42
00:01:49.080 --> 00:01:51.570
which are going to work in isolation,

43
00:01:51.570 --> 00:01:54.750
and we're only going to get that response, the artifact,

44
00:01:54.750 --> 00:01:57.810
which is going to be prompt back into this main agent.

45
00:01:57.810 --> 00:02:00.690
So in this way, the main conversation

46
00:02:00.690 --> 00:02:02.580
is going to have much easier time

47
00:02:02.580 --> 00:02:05.220
maintaining the lean context

48
00:02:05.220 --> 00:02:08.640
and we won't be needing to use the slash compact

49
00:02:08.640 --> 00:02:10.560
or slash clear command

50
00:02:10.560 --> 00:02:12.420
because we know that the more context we have

51
00:02:12.420 --> 00:02:15.330
in our main agent then performance degrades.

52
00:02:15.330 --> 00:02:18.180
So this is a very useful tool to see.

53
00:02:18.180 --> 00:02:20.970
All right, so just to summarize, we have our main agent,

54
00:02:20.970 --> 00:02:24.510
which is going to give one input to the subagent.

55
00:02:24.510 --> 00:02:26.940
Then the sub agent is going to do the work

56
00:02:26.940 --> 00:02:29.010
and is going to output one output

57
00:02:29.010 --> 00:02:30.930
to the main agent back again.

58
00:02:30.930 --> 00:02:35.160
And this is a very smart way to compress our context.

59
00:02:35.160 --> 00:02:37.170
All righty, now let me show you this in action

60
00:02:37.170 --> 00:02:39.660
and let me show you why this is so important.

61
00:02:39.660 --> 00:02:42.600
And here, I'm going to illustrate the context window

62
00:02:42.600 --> 00:02:46.170
as our conversation with cloud code progresses.

63
00:02:46.170 --> 00:02:50.040
Now, as you know, LLMs has token limits.

64
00:02:50.040 --> 00:02:53.460
So we are kept with the number of tokens we can send them.

65
00:02:53.460 --> 00:02:57.930
Now we can use a model with 200K or with 1 million tokens

66
00:02:57.930 --> 00:03:01.320
or in the future maybe it would be 2 million or 10 million,

67
00:03:01.320 --> 00:03:03.660
but this number is finite.

68
00:03:03.660 --> 00:03:07.380
Now obviously, we do not want to reach the token limit

69
00:03:07.380 --> 00:03:09.630
because one, if we surpasses

70
00:03:09.630 --> 00:03:12.870
then our request is going to fail to the LLM,

71
00:03:12.870 --> 00:03:14.700
and even if the request didn't fail,

72
00:03:14.700 --> 00:03:17.010
then the answer we going to get back,

73
00:03:17.010 --> 00:03:18.660
it's going to cost us more

74
00:03:18.660 --> 00:03:21.600
because every token is going to cost more,

75
00:03:21.600 --> 00:03:25.200
and it's going to be slower because we have more tokens.

76
00:03:25.200 --> 00:03:27.600
So the latency is going to increase.

77
00:03:27.600 --> 00:03:29.490
And most importantly,

78
00:03:29.490 --> 00:03:32.220
if we're going to get near this context limit,

79
00:03:32.220 --> 00:03:33.300
it's pretty safe to say

80
00:03:33.300 --> 00:03:36.450
that we're going to encounter some context pollution

81
00:03:36.450 --> 00:03:39.510
and we're not going to get the results that we want

82
00:03:39.510 --> 00:03:41.430
because we have tons of contexts,

83
00:03:41.430 --> 00:03:43.830
which is maybe not relevant for the task,

84
00:03:43.830 --> 00:03:45.960
and really this is not optimal.

85
00:03:45.960 --> 00:03:50.220
So we want to keep our context lean as much as we can here.

86
00:03:50.220 --> 00:03:53.490
And in our interaction with cloud code,

87
00:03:53.490 --> 00:03:56.700
every turn that we make, every message that we send,

88
00:03:56.700 --> 00:03:58.620
it's going to consume tokens

89
00:03:58.620 --> 00:04:02.790
and it's going to add tokens into our context window.

90
00:04:02.790 --> 00:04:06.450
So maybe in the first turn we added 10k tokens

91
00:04:06.450 --> 00:04:09.240
then in the second it increased to 30k.

92
00:04:09.240 --> 00:04:11.790
And by getting to the fifth turn,

93
00:04:11.790 --> 00:04:14.460
we reached 100K tokens.

94
00:04:14.460 --> 00:04:17.880
And if we're going to be using one instance of cloud code,

95
00:04:17.880 --> 00:04:19.471
then in some point or another,

96
00:04:19.471 --> 00:04:23.730
we will be needing either our compact window

97
00:04:23.730 --> 00:04:25.440
with the slash compact command

98
00:04:25.440 --> 00:04:28.620
or we would be needing to maybe clear everything

99
00:04:28.620 --> 00:04:31.320
or even open up a new instance of cloud code

100
00:04:31.320 --> 00:04:32.460
and start fresh.

101
00:04:32.460 --> 00:04:35.700
And that way we're going to use our history.

102
00:04:35.700 --> 00:04:38.190
And the only thing I want to demonstrate here

103
00:04:38.190 --> 00:04:41.430
is that we are limited in the context

104
00:04:41.430 --> 00:04:44.280
and we can't simply use one cloud code instance

105
00:04:44.280 --> 00:04:46.170
to do everything that we need.

106
00:04:46.170 --> 00:04:49.740
The context limit is limiting our interaction

107
00:04:49.740 --> 00:04:52.830
with cloud code, and this is my main point here,

108
00:04:52.830 --> 00:04:57.360
but with subagents, it's a very elegant solution

109
00:04:57.360 --> 00:05:00.150
to maybe increase this limit.

110
00:05:00.150 --> 00:05:04.170
And every time we're going to be using a subagent,

111
00:05:04.170 --> 00:05:06.378
it's going to be running with its own context window

112
00:05:06.378 --> 00:05:09.900
and every token is going to consume

113
00:05:09.900 --> 00:05:12.060
every token that it's going to be using,

114
00:05:12.060 --> 00:05:16.890
it's not going to be calculated in our main cloud agent.

115
00:05:16.890 --> 00:05:20.580
So by the end of this sub-agent execution,

116
00:05:20.580 --> 00:05:24.120
it's only going to return one condensed response.

117
00:05:24.120 --> 00:05:28.080
So it may be 15K or 20K tokens with the summary

118
00:05:28.080 --> 00:05:29.580
and the code that they changed.

119
00:05:29.580 --> 00:05:30.780
But the point here

120
00:05:30.780 --> 00:05:34.590
is that we won't be accumulating these context here,

121
00:05:34.590 --> 00:05:37.170
and you can see how elegant this is

122
00:05:37.170 --> 00:05:39.630
for our context engineering efforts.

123
00:05:39.630 --> 00:05:43.050
And each side chain, each sub agent, which is going to run,

124
00:05:43.050 --> 00:05:45.780
it's going to be running with its own system prompt,

125
00:05:45.780 --> 00:05:48.330
which we tailored made according to that need,

126
00:05:48.330 --> 00:05:50.730
so it would be able to solve a task

127
00:05:50.730 --> 00:05:53.880
that it should do in a better way than the main agent.

128
00:05:53.880 --> 00:05:56.130
That's the entire point of subagents here.

129
00:05:56.130 --> 00:05:59.640
So this is super, super elegant, this idea here,

130
00:05:59.640 --> 00:06:01.503
and it's very, very powerful.

