1
00:00:01,370 --> 00:00:06,920
‫So Pandas is a software library written for the Python programming language for data manipulation

2
00:00:06,920 --> 00:00:07,760
‫and analysis.

3
00:00:08,000 --> 00:00:11,300
‫This is specifically for data and information and analysis.

4
00:00:14,240 --> 00:00:16,550
‫First, we need to import pandas.

5
00:00:19,530 --> 00:00:20,730
‫So we'll write import.

6
00:00:22,350 --> 00:00:24,180
‫pandas as pd.

7
00:00:27,070 --> 00:00:32,240
‫If you are using anaconda, Anaconda have automatically install pandas into your system

8
00:00:32,250 --> 00:00:37,780
‫so you don't need to install it separately, you just have to import it into your workspace.

9
00:00:39,700 --> 00:00:40,450
‫for pandas

10
00:00:40,480 --> 00:00:42,820
‫We will be using our customer data.

11
00:00:43,030 --> 00:00:43,720
‫csv file

12
00:00:44,170 --> 00:00:47,680
‫You can find this file in the resources section of this video.

13
00:00:48,520 --> 00:00:53,120
‫So go on, download this file and put this in your folder.

14
00:00:55,060 --> 00:01:02,100
‫We will start by importing a customer.csv file. So we will write data one

15
00:01:02,410 --> 00:01:03,480
‫This is our variable.

16
00:01:06,100 --> 00:01:08,580
‫Then we'll will write a pandas function to import

17
00:01:08,670 --> 00:01:09,200
‫csv

18
00:01:09,320 --> 00:01:13,130
‫That is pd.read_csv

19
00:01:15,560 --> 00:01:17,970
‫Then we provide the location of our csv file

20
00:01:20,850 --> 00:01:21,790
‫Remember to change

21
00:01:22,020 --> 00:01:24,270
‫This backspaces into forward slashes.

22
00:01:30,680 --> 00:01:34,090
‫Then the file name customer.csv

23
00:01:37,770 --> 00:01:40,090
‫And then headers = 0

24
00:01:40,150 --> 00:01:42,160
‫Since our first row contain the header.

25
00:01:44,890 --> 00:01:49,140
‫If we run this, we'll get our table in variable data 1

26
00:01:50,650 --> 00:01:58,640
‫If we write data 1.head, we'll get the first five rows of our data.

27
00:02:04,500 --> 00:02:11,820
‫You can see we have customer id, customer name, segment, age, country, city, state, postal code and

28
00:02:11,820 --> 00:02:13,450
‫the region as our columns.

29
00:02:14,550 --> 00:02:17,340
‫Then we have multiple customer details as rows

30
00:02:18,440 --> 00:02:20,330
‫First is a customer

31
00:02:20,370 --> 00:02:20,610
‫id

32
00:02:20,640 --> 00:02:22,860
‫This is a unique id for each customer.

33
00:02:23,610 --> 00:02:25,380
‫Second column is a customer name.

34
00:02:25,780 --> 00:02:27,710
‫Here we have a customer name, full name.

35
00:02:28,350 --> 00:02:31,610
‫Then that is the segment whether the customer belong

36
00:02:32,330 --> 00:02:34,710
‫to a consumer segment or corporate segment.

37
00:02:35,430 --> 00:02:38,950
‫Then we have a column for age. The age of customer.

38
00:02:39,090 --> 00:02:43,380
‫Then the country, city, state, postal code, region of that customer.

39
00:02:44,220 --> 00:02:48,150
‫If you want to grab more rows, you can provide the number.

40
00:02:48,270 --> 00:02:50,850
‫in this bracket, by default it is five.

41
00:02:50,940 --> 00:02:55,290
‫If you write 10 in the output you will get 10 rows

42
00:02:58,070 --> 00:03:01,760
‫Now, here you are seeing this zero one, two, three, four.

43
00:03:02,510 --> 00:03:05,870
‫This all are the index of this table.

44
00:03:06,900 --> 00:03:10,390
‫For example, 0th rows will be this row.

45
00:03:11,150 --> 00:03:17,130
‫If you want to add our customer id as an index, since our customer id is primary key.

46
00:03:17,250 --> 00:03:19,200
‫We can add it as an index.

47
00:03:19,650 --> 00:03:22,680
‫We'll write this csv file into another data type

48
00:03:22,760 --> 00:03:26,960
‫That is data 2. We'll copy

49
00:03:27,470 --> 00:03:28,020
‫the above command

50
00:03:33,430 --> 00:03:35,590
‫two and we will write another

51
00:03:35,640 --> 00:03:38,750
‫parameter that its index underscore column

52
00:03:39,370 --> 00:03:45,310
‫We are providing the location of index column,and for our data it is customer id.

53
00:03:46,350 --> 00:03:47,390
‫Which is the 0th

54
00:03:47,430 --> 00:03:49,060
‫column of our data.

55
00:03:49,920 --> 00:03:53,190
‫Since this is the first column, the index is zero.

56
00:03:53,220 --> 00:03:54,810
‫That's why we are providing zero.

57
00:03:55,000 --> 00:03:57,600
‫Similar to what we provided for headers

58
00:03:58,090 --> 00:04:02,490
‫So our headers was present in the first row. So I added 0, the location of it.

59
00:04:04,290 --> 00:04:06,720
‫here also our index column is zero.

60
00:04:06,990 --> 00:04:13,050
‫If we run this and again, if we run the head of this.

61
00:04:17,510 --> 00:04:21,740
‫You can see now zero one, two, three, four indexes out of one.

62
00:04:22,050 --> 00:04:24,050
‫Now these are our indexes.

63
00:04:28,010 --> 00:04:31,600
‫These are important and we will discuss about it in a short while.

64
00:04:32,860 --> 00:04:36,290
‫Now head command is used to view the sample of your data.

65
00:04:37,010 --> 00:04:44,240
‫If you want to know statistics of your data, you write data1

66
00:04:46,930 --> 00:04:50,430
‫Dot describe, dot describe it as a keyword.

67
00:04:52,960 --> 00:04:53,870
‫And run this.

68
00:04:57,750 --> 00:05:00,810
‫So there are only two integer values in our data.

69
00:05:01,440 --> 00:05:05,490
‫That's why we are only getting two columns here

70
00:05:05,740 --> 00:05:06,460
‫First is Region

71
00:05:06,540 --> 00:05:08,070
‫And second is post code.

72
00:05:09,350 --> 00:05:15,740
‫Here you can see the total count of value, the mean of age , the standard deviation of age, minimum

73
00:05:15,750 --> 00:05:16,970
‫age, maximum age.

74
00:05:17,360 --> 00:05:18,650
‫These are the percentile value.

75
00:05:18,720 --> 00:05:20,960
‫This is a 25 percentile value.

76
00:05:21,470 --> 00:05:30,110
‫So if you arrange all the age in ascending order, the value presented that 25th percentile of that data

77
00:05:30,260 --> 00:05:31,100
‫is this value.

78
00:05:31,940 --> 00:05:39,110
‫Similarly, this is the 50th percentile, also known as the median value. This is the 70 percentile value of

79
00:05:39,110 --> 00:05:39,410
‫age.

80
00:05:39,760 --> 00:05:41,780
‫And this is the maximum value of age.

81
00:05:44,300 --> 00:05:51,530
‫This we will discuss in univariate analysis, which we will be covering in the later part of this course.

82
00:05:52,880 --> 00:05:55,830
‫There are two ways to Index a data frame.

83
00:05:57,710 --> 00:06:00,360
‫So we discuss earlier while importing this data.

84
00:06:00,410 --> 00:06:04,350
‫We can provide index column, for our data 1index

85
00:06:04,400 --> 00:06:09,610
‫We did not provide any index column for data two our index column

86
00:06:09,660 --> 00:06:10,680
‫is customer id

87
00:06:12,140 --> 00:06:17,820
‫So if we want to view the first row of our data, we have to either use loc

88
00:06:18,310 --> 00:06:22,100
‫Or I loc, if you if we use data one.

89
00:06:22,460 --> 00:06:24,140
‫dot iloc

90
00:06:26,230 --> 00:06:27,520
‫And then we provide zero.

91
00:06:29,210 --> 00:06:36,560
‫What iloc will do is, It will grab the data that is present at the 0th Index of our data frame.

92
00:06:38,400 --> 00:06:44,300
‫So our output is same as what the first row is of our data frame.

93
00:06:45,360 --> 00:06:52,830
‫If we want to use the index column, which we defined while creating our data frame, we have to use loc

94
00:06:57,810 --> 00:07:00,390
‫And in the bracket, if we write the customer id.

95
00:07:02,740 --> 00:07:06,140
‫CG-12520

96
00:07:09,020 --> 00:07:14,180
‫In data 2 we defined our index column as customer id

97
00:07:14,250 --> 00:07:18,800
‫So now we can use loc keyword to get the data of this

98
00:07:19,120 --> 00:07:21,230
‫customer id, If we run this,

99
00:07:23,090 --> 00:07:28,940
‫You can see we are getting all the details of our customer except the customer id. since this

100
00:07:28,940 --> 00:07:31,160
‫is the index column. Similarly.

101
00:07:31,910 --> 00:07:37,850
‫If I don't know this id and I just wanted to grab the first customer

102
00:07:38,480 --> 00:07:39,260
‫I can use.

103
00:07:39,380 --> 00:07:39,840
‫iloc

104
00:07:51,160 --> 00:07:51,440
‫here also

105
00:07:51,560 --> 00:07:54,600
‫I'm getting the same detail, here

106
00:07:54,840 --> 00:08:01,470
‫I was using iloc, with iloc you have to use the serial number zero and so on with loc

107
00:08:01,620 --> 00:08:05,530
‫You can use the index column that you provided earlier

108
00:08:06,180 --> 00:08:10,410
‫So if you know the position you can use, I lock.

109
00:08:11,460 --> 00:08:14,880
‫And if you know the value, you can use the loc.

110
00:08:17,430 --> 00:08:23,520
‫Just like in this 10 data frame, you can also mention multiple values using colon operator.

111
00:08:23,580 --> 00:08:29,370
‫So for example, if I write data 2.iloc

112
00:08:31,920 --> 00:08:33,210
‫0 colon five

113
00:08:36,570 --> 00:08:39,570
‫This will give me the data of first five rows.

114
00:08:40,960 --> 00:08:47,260
‫where, the index, value is zero, one, two, three and four, the number five is excluded from

115
00:08:47,260 --> 00:08:47,770
‫this result.

116
00:08:48,350 --> 00:08:51,310
‫So I'm getting data of this five customer.

117
00:08:52,620 --> 00:08:56,100
‫You can use steps as well, if I write 2

118
00:08:57,650 --> 00:08:58,470
‫And run this.

119
00:09:00,270 --> 00:09:07,190
‫I'm getting only three results since I'm using steps, that's all for panda.

120
00:09:08,330 --> 00:09:14,790
‫We will be using panda a lot more while doing our work and will discuss new topics than in their order.