1
00:00:03,710 --> 00:00:11,660
What are document loaders and why it is important for you to understand a.

2
00:00:12,580 --> 00:00:16,750
The versatility of them, the number of things you can do with them.

3
00:00:16,750 --> 00:00:25,600
So the important thing of this chapter is to understand that with launching, you can extract a data

4
00:00:25,600 --> 00:00:36,550
from a lot of, uh, sources and formats so you can extract data from web pages, databases, YouTube,

5
00:00:36,550 --> 00:00:43,990
videos, Twitter, Excel, pandas, notion, Figma, Huggingface, GitHub, whatever you want, and

6
00:00:43,990 --> 00:00:46,030
in many different formats.

7
00:00:46,030 --> 00:00:52,840
You can extract data from PDF, from HTML, from Json files, from word files, PowerPoint, etc.,

8
00:00:52,840 --> 00:00:56,950
etc. so that's the most important concept to get.

9
00:00:56,950 --> 00:01:01,840
From this chapter we are going to experiment with three examples.

10
00:01:02,350 --> 00:01:08,710
We are going to load a data from a PDF document, extract data from a PDF document.

11
00:01:08,710 --> 00:01:16,690
We are going to extract data from a YouTube video, and we are going to extract data from a web page.

12
00:01:16,720 --> 00:01:21,520
The first exercise is going to be very simple, very easy and very fast.

13
00:01:21,520 --> 00:01:29,590
The second one is a little bit, uh, slow because getting extracting information from a YouTube video

14
00:01:29,590 --> 00:01:32,620
can be, uh, can can take time especially.

15
00:01:32,620 --> 00:01:35,650
Well, depending on the length of the, of the video.

16
00:01:35,650 --> 00:01:42,790
And you will see that in this case you need to install a module called ffmpeg.

17
00:01:42,790 --> 00:01:43,540
Right.

18
00:01:43,960 --> 00:01:45,760
This is this is very easy.

19
00:01:45,760 --> 00:01:53,950
Remember if you have any problem installing these packages you can go to ChatGPT and ask how to do it.

20
00:01:53,950 --> 00:01:56,860
Or go to Google or go to StackOverflow.

21
00:01:56,860 --> 00:02:06,010
Remember all these uh, usual places where we find information on a daily basis so that, uh, regarding

22
00:02:06,010 --> 00:02:08,590
the, uh, YouTube data extraction.

23
00:02:08,590 --> 00:02:17,380
And finally we will, uh, we will have three exercises, uh, about loading content or extracting data

24
00:02:17,380 --> 00:02:18,700
from a web page.

25
00:02:18,700 --> 00:02:20,830
So we will see three different options.

26
00:02:20,830 --> 00:02:29,380
And in this case we will see the necessity to clean and prepare the loaded data before using it.

27
00:02:29,380 --> 00:02:33,850
So this is the one of the main tasks of the data scientist.

28
00:02:34,120 --> 00:02:35,710
Uh, clean data.

29
00:02:35,710 --> 00:02:42,850
Because usually when you extract data from, from whatever source you, you, you want, usually the

30
00:02:42,850 --> 00:02:48,460
data is not going to come in the exact format you want it to be.

31
00:02:48,460 --> 00:02:55,630
So after extracting the data, you will need a preparation phase, right?

32
00:02:55,630 --> 00:03:00,310
You will need to clean the data and to prepare it according to whatever you want.

33
00:03:00,580 --> 00:03:06,490
So you will see in this exercise that sometimes the data especially you will see it here.

34
00:03:06,490 --> 00:03:14,020
The data coming from a web page is probably, uh, in need of a cleaning and preparation, etc. so let's

35
00:03:14,020 --> 00:03:17,050
go to the code in the right side of the screen.

36
00:03:17,260 --> 00:03:25,360
Uh, first, as always, we connect with the dot m file where we have the credentials in order to start

37
00:03:25,360 --> 00:03:29,650
using, uh, OpenAI, uh, the ChatGPT, etc., etc..

38
00:03:29,650 --> 00:03:30,220
Right.

39
00:03:30,520 --> 00:03:35,170
So the first thing we do is we extract data from a PDF document.

40
00:03:35,440 --> 00:03:38,410
You will need to install Pi PDF.

41
00:03:38,440 --> 00:03:44,680
You know, you can do it from the terminal or and also from the Jupyter notebook.

42
00:03:44,680 --> 00:03:56,050
And once you have that you can start, uh, you know, a creating the and the, the extraction and performing

43
00:03:56,050 --> 00:04:00,250
the data extraction, you see with just two steps.

44
00:04:00,250 --> 00:04:07,030
And once we have the the data, the extracted data in this variable, we can start asking questions

45
00:04:07,030 --> 00:04:12,610
to this variable, like, okay, how many pages do you have a what is the content.

46
00:04:12,610 --> 00:04:19,779
The 500 initial characters of the of the first page, etc., etc. as you can see, this is just a user

47
00:04:19,779 --> 00:04:27,940
guide and not not important about, uh, how to extract data from a YouTube video.

48
00:04:27,940 --> 00:04:33,250
So we are going to extract the data from the YouTube audio.

49
00:04:33,250 --> 00:04:33,580
Right?

50
00:04:33,580 --> 00:04:39,250
And we are going to convert this audio into text in order to do that.

51
00:04:39,250 --> 00:04:44,740
As I was saying, you are going to need this package installed in your computer.

52
00:04:44,740 --> 00:04:45,700
If you have a mac.

53
00:04:45,700 --> 00:04:49,390
It's very easy to do with homebrew or with with brew.

54
00:04:49,840 --> 00:04:57,490
Uh, but in case of that doubt, you go to ChatGPT and ask how to do that, and then you load, you

55
00:04:57,490 --> 00:04:59,560
know, the necessary modules.

56
00:04:59,560 --> 00:05:04,240
You install the necessary, uh, packages to, to proceed.

57
00:05:04,240 --> 00:05:06,610
And then the process is very easy.

58
00:05:06,610 --> 00:05:09,220
It's not super fast, but is very easy.

59
00:05:09,220 --> 00:05:11,380
In this case we are extracting the data.

60
00:05:11,670 --> 00:05:14,160
From a not very long video.

61
00:05:14,190 --> 00:05:16,980
I think it's like seven minutes video, something like that.

62
00:05:16,980 --> 00:05:20,790
And as you can see, we are extracting the well.

63
00:05:20,790 --> 00:05:24,360
We are extracting the audio and converting it into text.

64
00:05:24,360 --> 00:05:31,080
So it's a very, uh, interesting, uh, interesting tool.

65
00:05:31,080 --> 00:05:31,770
Okay.

66
00:05:32,190 --> 00:05:35,310
Third extraction websites.

67
00:05:35,910 --> 00:05:38,100
We are going to use three options.

68
00:05:38,100 --> 00:05:43,590
The first one is with the web based loader, uh module okay.

69
00:05:43,590 --> 00:05:46,110
You will see the process is very easy.

70
00:05:46,110 --> 00:05:53,910
But when you look at how we get the data, you see it's not it's not super nice.

71
00:05:53,910 --> 00:05:54,210
Right.

72
00:05:54,210 --> 00:05:59,610
So you are we we will need to clean and prepare this data further.

73
00:05:59,940 --> 00:06:05,520
The second option is with the unstructured HTML loader module.

74
00:06:05,820 --> 00:06:11,340
So in this case our data is in a more compact, uh way.

75
00:06:11,340 --> 00:06:15,150
But also we require preparation and cleaning.

76
00:06:15,720 --> 00:06:24,240
And in the third option with beautiful soup you can see that it's like a solution in the middle of the

77
00:06:24,240 --> 00:06:24,990
previous two.

78
00:06:25,020 --> 00:06:29,130
So it also requires a preparation.

79
00:06:29,130 --> 00:06:38,550
So the more the main point of this chapter is to understand that with Lang Chain, we have the possibility

80
00:06:38,550 --> 00:06:43,740
to extract data from many different sources and many different formats.

81
00:06:43,740 --> 00:06:45,720
Okay, so this is the important thing.

82
00:06:45,720 --> 00:06:51,390
The second important thing is to remember that in the Lang chain documentation, you have information

83
00:06:51,390 --> 00:06:56,310
about all the data loaders and about all the operations you can perform with them.

84
00:06:56,310 --> 00:07:04,320
That's the that's the place to go whenever you want to look for a data loader that, uh, responds to

85
00:07:04,320 --> 00:07:10,410
the particular necessity you may have for a LM application you are building.

