WEBVTT

00:02.700 --> 00:07.410
In this session, we would begin with the dolphins and find those.

00:08.370 --> 00:14.940
So they do things on the very basic data structure which are used in case of.

00:16.350 --> 00:23.360
These are two dimensional labelled data structures with columns of potentially different shapes.

00:23.610 --> 00:29.280
So we can have a growing column structure, but just kind of a double structure to work with.

00:29.850 --> 00:37.390
Now, you can think of it like a spreadsheet or a school table or a normal Excel spreadsheet.

00:37.650 --> 00:45.210
Usually what we work with our papers from databases or we work with Excel files or CSC files.

00:46.780 --> 00:56.080
Now, these are all created from a dictionary or tabular data, so we will learn how we can work with

00:56.080 --> 00:59.900
these data structures and how we can create a definition.

01:00.370 --> 01:05.200
So let us first begin by creation of our data feme.

01:08.280 --> 01:15.120
Now, before starting with the first thing which we will have to do is we will have to import the Fondas

01:15.120 --> 01:18.990
library and the library so we will have thesea.

01:20.770 --> 01:21.610
Import.

01:23.710 --> 01:28.360
No, I as a. an import.

01:30.700 --> 01:33.060
Fondas, Aspy.

01:34.920 --> 01:43.950
Now, the next step would be to start creating different datasets now to create a dataset, let us create

01:43.950 --> 01:49.110
certain attributes which we will be looking for or different features which we will be looking for.

01:49.440 --> 01:51.630
OK, so let us create a feature.

01:51.630 --> 01:52.490
See Idee.

01:53.860 --> 01:58.420
And let this IDB and the run random.

02:00.170 --> 02:06.800
North London and I want this idea to range from.

02:07.900 --> 02:21.580
So let us create these IDs in a random manner and let's keep those values to be low from 20 and high

02:21.580 --> 02:28.100
value to visit 2000 and let us be 20 such values.

02:28.360 --> 02:31.390
So I give size equal to.

02:33.270 --> 02:35.490
RCMP say 20.

02:36.450 --> 02:38.810
So let me explain it for you.

02:42.260 --> 02:45.950
So we give size equal to.

02:49.370 --> 02:49.940
20.

02:51.510 --> 02:54.860
So here we have random ideas which have been created.

02:55.880 --> 02:58.580
Now, the next thing which we want is.

03:01.380 --> 03:09.900
Let us see the age of a person, so let us bridge each column and it should be in the dot random.

03:11.320 --> 03:12.040
Dorte.

03:14.130 --> 03:25.410
Run in leathered range from as low as 15, high to be, let's say.

03:27.160 --> 03:28.210
Sixty three.

03:29.690 --> 03:34.520
And size should again be 20.

03:36.830 --> 03:41.060
We are giving the citizens 20 because we want 20 rolls of data.

03:41.920 --> 03:45.340
So we have created each.

03:46.640 --> 03:53.780
Now, next thing what we can look for is now one more thing to take care of is that when we are done

03:53.780 --> 04:03.680
reading these ideas, these ideas of the dawn want them to be randomly generated, or we might keep

04:03.680 --> 04:07.200
them random or we can simply create them in a different way.

04:07.370 --> 04:11.440
It is completely your choice if you want to put place equal to Broadfoot.

04:13.950 --> 04:27.120
The next thing, let us create a column, see Sydney and in Sydney, we can put in the dot random dot

04:27.600 --> 04:28.560
choice.

04:29.480 --> 04:36.110
And the choice should be taken from these cities, say New York.

04:37.490 --> 04:38.600
And I want.

04:41.000 --> 04:41.720
Darlene.

04:42.910 --> 04:44.800
And I want to see.

04:46.320 --> 04:50.220
Iris and I won.

04:54.050 --> 04:58.330
OK, and I won 20 such occurrences of this again.

04:59.350 --> 05:07.750
So I give city, so we have these cities present now, the next thing what we can do is we can have

05:07.750 --> 05:11.110
if someone's a loan has been approved or not.

05:12.090 --> 05:13.710
So we can see approval.

05:14.670 --> 05:24.540
And we can give in or random choice, and in this we can give.

05:25.680 --> 05:33.180
Zero and one, and you will get 20 such values, so here we have approval.

05:35.560 --> 05:41.270
Similarly, we can have as many problems, we can create as many problems as we want, so I just keep

05:41.270 --> 05:42.520
it to dispatch only.

05:42.880 --> 05:53.050
And from this now there are different ways how we can convert data to data from the first way, which

05:53.050 --> 06:02.500
I would tell you is converting this entire data into a list and then converting that list into a data

06:02.500 --> 06:02.830
frame.

06:03.160 --> 06:11.320
So what we can do is we can simply say or did the my data or we can say the list.

06:12.360 --> 06:20.370
And in this, I want to have a list which has been generated from the zip of.

06:23.200 --> 06:24.130
Age.

06:26.190 --> 06:26.940
City.

06:29.630 --> 06:30.770
Approval.

06:32.410 --> 06:35.740
And I so I'm giving idee.

06:37.010 --> 06:38.890
Age, city answer.

06:40.870 --> 06:42.820
So I will simply.

06:43.900 --> 06:46.630
Run this and get the biggest.

06:48.000 --> 06:49.560
So this is my off list.

06:49.860 --> 06:57.540
Now, each of these values have been combined and created into couples and these couples are put into

06:57.540 --> 06:58.110
a list.

06:58.140 --> 06:59.110
So this is what we get.

06:59.460 --> 07:04.410
Now, the next thing which we will be doing is converting it into a data frame.

07:04.410 --> 07:10.380
So to convert anything to a data frame, what we simply do is we give the name data frame and we'll

07:10.380 --> 07:14.510
see DOT data then.

07:15.870 --> 07:26.250
And then we give the data, so the data is my D.F. list, so I will give data is equal to the list and

07:26.250 --> 07:32.700
I have to give the column names, because if I will not provide the column names, then what will happen

07:32.700 --> 07:36.360
is it will by default, consider the indexes to be the column names.

07:36.510 --> 07:42.600
And I don't want to work with the indexes because it will be really difficult to work with the indexes.

07:42.600 --> 07:45.840
And I will get to know that which index belongs to which columns.

07:45.840 --> 07:52.890
And when we are working with huge data set at that time, we can get confused and we can do something

07:52.890 --> 07:58.180
which is not actually the necessity or not required and we can just mess it up.

07:58.440 --> 08:03.870
So what we will be doing is we will create column names so we can see columns.

08:05.790 --> 08:12.750
And we will give the names, so let us give the column names like this will need these as my column

08:12.750 --> 08:13.260
names.

08:16.010 --> 08:18.290
I give this into a list from.

08:19.930 --> 08:23.050
And these have to be springform.

08:33.620 --> 08:42.740
Now, if I had done this and I will be able to view this data frame by typing vertically, so this is

08:42.740 --> 08:49.340
the lead offering, which has been creative, I can view eatin' and you can see the indexes from zero

08:49.340 --> 08:54.350
to 90 and the following names that I'd need to be in approval.

08:54.620 --> 08:58.130
And what I can do is I can view different data points.

08:58.130 --> 09:05.420
We can view different columns also by simply saying see the if not age.

09:08.020 --> 09:18.400
So this gives me the data from the Age column, I can do something like the thought ID, so I was able

09:18.400 --> 09:20.620
to see the values from the ID column.

09:20.890 --> 09:27.760
Now, this is possible only if we don't have any spaces in my column names in case I have any spaces

09:27.760 --> 09:30.020
in my column names, then this would not be possible.

09:30.400 --> 09:31.900
So what I can do is.

09:34.320 --> 09:41.820
Here, if I have something like Idi, Idi, something like this, then what we happen is now I try to

09:41.820 --> 09:44.700
retrieve these IDs as IDs face ID.

09:44.730 --> 09:48.360
Now, this would throw an error because it doesn't really know what this IDs.

09:48.630 --> 09:57.060
So it is always suggested to either replace any spaces present in the column names by underscore or

09:57.120 --> 09:59.460
just don't have any other source in the column names.

09:59.850 --> 10:07.780
Now, if you want to access these values instead of using this, we can also use this index.

10:08.310 --> 10:12.890
So in this way I can view the ID column into another thing.

10:12.900 --> 10:21.580
What I can do is I can say the dot head and it will show me the top five rows of the data.

10:22.320 --> 10:27.330
I can also define the number of values which I want, but by default it will give five.

10:27.540 --> 10:30.960
If I want to view mine, then it will give me nine rows of data.

10:31.140 --> 10:34.440
So it is completely up to me what all values I want to see.

10:34.920 --> 10:42.810
If I want to view the deal, then I can say the dot the and again, it will show the bottom five to

10:42.810 --> 10:43.020
me.

10:44.580 --> 10:47.820
Now, the next thing what I can do is.

10:49.720 --> 11:00.340
I can create a data frame using a dictionary format, so let us be downloaded of via and likely this

11:00.340 --> 11:01.570
will override the previous.

11:02.560 --> 11:05.080
So it's a fresh data from which I will be creating.

11:05.110 --> 11:09.200
So it's not a data frame.

11:09.220 --> 11:13.810
Please make sure that the F is capital in this.

11:15.060 --> 11:21.390
And then we give the wrong reasons and then we create the dictionary and in the dictionary we will simply

11:21.390 --> 11:23.440
give the column name, which we want to have.

11:23.760 --> 11:24.850
So I give it.

11:27.870 --> 11:38.700
Along with it, I will give the of the data which I have, so I give you, then I'll give each.

11:40.260 --> 11:41.610
And each.

11:43.120 --> 11:44.170
Then we have.

11:45.450 --> 11:46.440
Approver.

11:48.480 --> 11:54.030
And approval, and we also have to be so we give Cindy.

11:55.610 --> 11:57.460
And Sydney.

11:59.000 --> 12:04.140
So like this, we can think of another data name, and this is data from the 10th.

12:05.360 --> 12:07.430
So we have created this definition.

12:07.460 --> 12:13.700
Now, one thing you can notice here is that the sequence has also changed once I change the sequence

12:13.700 --> 12:14.810
in the dictionary.

12:15.980 --> 12:20.300
So this is one thing to take care of.

12:23.060 --> 12:32.780
The next thing is, let us see, I want to access a particular value so we have access.

12:40.520 --> 12:50.620
So we can access values, so let us say I want to access it again so I can say the and then I can definitely

12:50.690 --> 12:51.800
see it here.

12:52.940 --> 12:59.260
Or I can use a door that is completely up to me so I can do it this way now.

13:00.270 --> 13:03.900
What I can do is I can see the.

13:08.430 --> 13:09.120
Age.

13:10.130 --> 13:11.450
Greater than C..

13:13.030 --> 13:13.540
30.

13:15.260 --> 13:22.710
So it will give me out the data, but when people have age greater than 30 or say greater than 50,

13:22.740 --> 13:24.050
you want greater than 50.

13:24.500 --> 13:28.140
So we get the data when people have age greater than 50.

13:28.670 --> 13:40.280
Now, another thing, what we can do is we can access the data from files also so we can use read CSFI

13:40.280 --> 13:40.990
for that.

13:43.280 --> 13:51.230
So we will simply read the part of the fighting, so for you to define is equal to the fighting name

13:51.230 --> 13:54.480
or by name in case it is in the same folder.

13:54.830 --> 13:56.770
So I have bang.

13:58.160 --> 13:58.660
Full.

14:00.130 --> 14:02.130
Dart County.

14:02.830 --> 14:09.880
So I have this file and I want to read this file into a data frame, so I will say data frame is equal

14:09.880 --> 14:15.760
to read orders for CSP and this has to be a battle field.

14:16.480 --> 14:24.160
So if we don't read CSFI and I give the file for you, so I will get the data frame and you can review

14:24.160 --> 14:26.930
the data frame again by using the updated.

14:29.850 --> 14:37.140
So this is the entire state of Maine now, please see that all this detail, which I have provided,

14:37.350 --> 14:39.570
has come in a single column.

14:40.970 --> 14:46.790
Here, this is the view which we get when we have the volume separator, but here all the details coming

14:46.790 --> 14:48.180
into simply single volume.

14:48.200 --> 14:52.820
So what we can do is we have to provide the limit, though, so we'll see.

14:56.510 --> 14:57.440
Is equal to.

14:58.810 --> 14:59.650
Semicolon.

15:01.080 --> 15:08.750
So now it is separated all the data in two different columns now.

15:09.790 --> 15:17.020
How we access these data and how we modify these data is something which we will know on in the next

15:17.020 --> 15:24.220
session, so we will learn how we will convert these columns into numeric form, which we have already

15:24.910 --> 15:29.320
taken a load off in the session, which we have discussed.

15:30.830 --> 15:36.120
So let us see how we can modify this done, how we can work with the state.