WEBVTT

00:01.440 --> 00:07.500
Hello again! In this video, we are going to look at binary files. A binary file is much closer to

00:07.500 --> 00:13.800
the computer than a text file, so we need to work with the computer. And computers are very literal.

00:14.190 --> 00:22.170
You need to say exactly what you mean, and you need to mean exactly what you say! To open a binary file,

00:22.170 --> 00:23.340
we use binary mode.

00:23.760 --> 00:25.050
So we have seen this before.

00:25.080 --> 00:29.070
We use fstream double colon binary. And then we have our open() call.

00:29.280 --> 00:35.580
So this is going to open the image dot BMP file in binary. Or we could just pass that to the constructor.

00:37.050 --> 00:40.830
We cannot use the shift operators for working with binary files.

00:41.460 --> 00:44.610
They perform conversions between numeric data and text.

00:45.030 --> 00:50.640
If we are doing output, they format the text data and, if we are doing input, they throw away whitespace

00:50.640 --> 00:51.180
characters.

00:52.320 --> 00:55.290
So the data in memory does not correspond to what is in the file.

00:55.740 --> 01:00.980
So we need to always use the read and write operators when we are working with binary files.

01:04.040 --> 01:09.890
If you look at a binary file, it is all 1's and 0's.

01:10.190 --> 01:12.500
There is no obvious indication what all that data means.

01:12.920 --> 01:18.650
So normally we use some kind of file format, to give the file a structure and make the data meaningful

01:18.650 --> 01:19.580
to applications

01:19.820 --> 01:20.870
that are going to work with it.

01:22.430 --> 01:24.140
Often, we can use standard formats.

01:24.140 --> 01:26.660
If we are creating an image, we could use JPEG.

01:26.660 --> 01:32.450
If we are creating a compressed file, we can use ZIP. Or we can make our own formats, if the standard ones

01:32.750 --> 01:33.950
do not provide what we need.

01:36.320 --> 01:42.440
The best way of working with a binary file is to create a struct which represents the file format.

01:42.920 --> 01:49.400
So each member of the structure represents one field in the file format. And then you can just grab this

01:49.400 --> 01:55.070
struct, in memory, and write it straight to the file. And for going the other way, you can just take the

01:55.070 --> 01:58.100
data out of the file and read it straight into your struct.

02:00.200 --> 02:03.500
We are going to use this very simple struct, just to talk about it.

02:03.980 --> 02:09.380
So this has a member which is a single character, and it has two members which are ints.

02:10.850 --> 02:13.820
And here we run into the first problem with binary data.

02:14.390 --> 02:17.360
The size of an int is dependent on the implementation.

02:18.020 --> 02:25.130
If, for example, we write a binary file on a 32-bit system and read it on a 64-bit system, then the

02:25.130 --> 02:27.770
data is going to be in the wrong place and we are going to get the wrong results.

02:28.190 --> 02:32.960
So we need to make sure that all the integers have the same size on every system, so we use the fixed

02:32.960 --> 02:33.800
size integers.

02:34.040 --> 02:36.230
In this case, we are using 32-bit integers.

02:39.800 --> 02:45.890
When we call read() or write(), the first argument is the address of the data. The start of the data.

02:46.490 --> 02:48.980
So this is going to be the address of the point object.

02:49.910 --> 02:52.370
We need to cast that 3to a pointer to char.

02:52.760 --> 02:57.070
We could use the old C cast, which is just char star inside brackets.

02:57.560 --> 03:02.060
But there is a specific cast for this in C++, which is reinterpret cast.

03:02.660 --> 03:08.030
So this says we are throwing away all the type information and this is just binary data. All ones and zeros.

03:09.650 --> 03:14.240
The argument to this is the address of the data, so that will be the address of the point object.

03:16.040 --> 03:20.180
The second argument to read() or write() will be the number of bytes in the object.

03:20.600 --> 03:22.580
So that is the size of the struct.

03:24.380 --> 03:26.280
And then our calls are going to look like this.

03:26.300 --> 03:33.200
So for write(), we are going to take the address of the point struct, and cast it, and pass the size

03:33.200 --> 03:33.680
of the point.

03:34.130 --> 03:35.900
And similarly for the read() operation.

03:36.350 --> 03:39.410
So this will have data in the point and it is going to write it to the file.

03:39.860 --> 03:43.990
This will read the data from the file and store it in the point object.

03:46.420 --> 03:50.380
Before we rush off and start reading and writing binary files, there are some more problems we need to

03:50.380 --> 03:50.920
think about.

03:51.550 --> 03:54.190
We need to think about memory alignment and padding.

03:58.480 --> 04:04.420
In modern computers, the memory circuits are optimized for accessing data, which is so called "word-

04:04.420 --> 04:04.870
aligned".

04:05.770 --> 04:11.080
So this means that each object in the data is at a multiple of the word size.

04:12.880 --> 04:20.290
If we have a 32-bit system, for example, and then the first object is at, let's say, 1000 hex in memory,

04:20.770 --> 04:26.350
then the next object must be at 4 or 8 or 16 or some multiple of 4, and so on.

04:26.770 --> 04:28.210
And that data is word-aligned.

04:29.500 --> 04:31.870
If the data is not word-aligned, then it is all over the place.

04:31.870 --> 04:35.830
So the first one could be at 1000, the next one at 2 and so on.

04:36.460 --> 04:40.750
And accessing memory this way [word-aligned] is much faster than accessing it that way [not word-aligned].

04:41.140 --> 04:44.830
In fact, there are some systems on which you cannot actually access memory like that [not word-aligned].

04:47.150 --> 04:50.840
So that is word alignment. We need to have the data on multiples of four.

04:54.670 --> 04:59.650
So what happens if we have a struct which is not word-aligned? Like ours, which has a character followed

04:59.650 --> 05:00.190
by an int.

05:01.480 --> 05:06.710
In this case, the compiler will usually add extra bytes to make sure that the data is laid out at

05:06.730 --> 05:07.570
multiples of four.

05:07.960 --> 05:12.010
So in this case, the compiler is going to add three bytes on a 32-bit system.

05:13.210 --> 05:18.280
So we have char, which is 1 byte, then we have these 3 bytes, so that makes 4.

05:18.430 --> 05:20.500
So the int is going to be at a multiple of 4.

05:20.860 --> 05:26.800
So if this is 1000, then this is going to be 4 bytes further. And then the int is 32 bits.

05:26.800 --> 05:28.060
So that is four bytes again.

05:28.240 --> 05:29.980
So the next one is four bytes further along.

05:30.400 --> 05:31.780
So that is all correctly aligned.

05:33.730 --> 05:38.770
Some people think they can be clever and use these padding points for their own nefarious purposes.

05:39.280 --> 05:40.250
This is not a good idea.

05:40.780 --> 05:44.190
This is not portable. If you have a different system,

05:44.230 --> 05:49.900
For example, if this is 64-bits, then there are going to be 7 bytes here, not 3. To make it

05:50.230 --> 05:50.890
8 bytes.

05:52.390 --> 05:56.440
And then we will have some extra padding bytes here, which will not be on the 32-bit one.

05:57.880 --> 06:01.810
And also, these are internal to the compiler. So it is possible that the compiler could do something

06:01.810 --> 06:04.630
itself with these. In which case your code is going to conflict with it.

06:05.050 --> 06:09.580
So, as far as you are concerned, these bytes are inaccessible. And they may not even exist!

06:11.650 --> 06:17.440
If we have a file format which expects data fields to be at offsets which are not multiples of 4,

06:17.920 --> 06:18.940
then we have a slight problem.

06:19.360 --> 06:21.670
For example, the bitmap format is very old.

06:21.670 --> 06:26.770
It's designed for 16-bit computers, so it has everything gets at 2-byte offsets.

06:28.090 --> 06:33.360
If we just go ahead and write the file, then the compiler is going to add the padding bytes, which will

06:33.370 --> 06:38.050
put everything on 4 bytes. And then the data will not match up with the bitmap format.

06:38.710 --> 06:44.020
So if you generate a bitmap like this and then try to display it, the program will not understand

06:44.120 --> 06:47.470
it. It will probably say that the data is corrupt, or it is not in the correct format.

06:49.320 --> 06:54.480
So compilers provides a means for affecting the way that it pads the data.

06:56.040 --> 07:01.650
There is a directive called hash pragma, which means everything after this is non-standard. So

07:01.650 --> 07:05.580
we have the nonstandard instruction pack which will change the alignment.

07:06.090 --> 07:07.050
And then we put push.

07:07.350 --> 07:09.870
And then the second argument is the number of bytes.

07:11.550 --> 07:14.790
So if we have 1 as the argument, then we have 1-byte alignment.

07:15.120 --> 07:16.440
So in effect, we have no padding

07:16.440 --> 07:19.080
at all. Sll the data just follows straight after the other.

07:20.460 --> 07:25.590
If we put 2 in here, then we have 2-byte alignment, so the data elements will be at multiples

07:25.590 --> 07:26.070
of 2.

07:26.970 --> 07:31.740
And then when we have finished, when we get to the end of our struct, we need to reset back to the normal

07:31.740 --> 07:32.250
alignment.

07:32.790 --> 07:34.890
So we just do a pop operation.

07:35.550 --> 07:40.350
If we do not do that, then all the code which follows will still be using the one byte alignment that

07:40.350 --> 07:41.070
we set up here.

07:41.640 --> 07:46.530
And if we call some library code, which assumes that we are using the standard alignment and we are using a different

07:46.530 --> 07:48.540
one, then we could well have problems.

07:50.340 --> 07:53.280
So this is non-standard, but it actually works with all the main compilers.

07:53.310 --> 07:55.950
So it is a kind of "de facto" standard, if you like.

07:57.390 --> 08:00.660
And the compilers also provide a compiler option you can use.

08:00.660 --> 08:03.570
But this will set all the the code in the source file.

08:04.620 --> 08:08.790
So that will be something on the command line or in the IDE, or you can add it to a Makefile.

08:11.350 --> 08:16.090
And then finally, there is a standard way which do this in C++ 11, which is the alignas keyword.

08:16.600 --> 08:23.590
So you can actually do this on a member by member basis. So you can say that you want this int to be on

08:23.590 --> 08:25.210
a multiple of four bytes.

08:27.600 --> 08:33.210
Unfortunately, this binary works for multiples of the word size. If you want to have an alignment

08:33.210 --> 08:37.530
which is less than the word size, if you want 1 byte or 2 bytes, then you cannot do that.

08:38.070 --> 08:40.110
The reason for this is that, this is standard.

08:40.110 --> 08:45.450
It has to work on every computer. And there are some computers which do not support unaligned access.

08:48.150 --> 08:49.950
Okay, so let's have an example of this.

08:50.530 --> 08:57.810
We are using the fixed size integers, so we include the cstdint header. Here is our point structure.

08:58.050 --> 09:03.740
We are going to try it first without the pragma's, to see what effect that has. In the main() function.

09:03.750 --> 09:06.690
we are creating an object of this point structure.

09:06.700 --> 09:08.700
So we have 'c' is the character

09:08.700 --> 09:11.010
'a', x is 1 and 'y' is 2.

09:11.640 --> 09:15.360
Then we open this file, file dot bin, in binary mode.

09:16.290 --> 09:18.000
We check it is open and we do the write().

09:18.000 --> 09:24.360
So we are casting the address of this object to a point to char, and we are passing the number of bytes

09:24.360 --> 09:25.620
in this structure as the argument.

09:26.310 --> 09:30.360
And then we close the file, because we are about to read from it and we want to make sure that all the

09:30.360 --> 09:31.920
data gets written to disk.

09:34.440 --> 09:39.210
Then we open the file, in binary mode again. We create another point object.

09:39.630 --> 09:42.000
So we are going to read the data into this object.

09:42.990 --> 09:44.250
Then we check the file is open.

09:44.250 --> 09:46.770
We read the data into this object.

09:47.520 --> 09:54.000
Then we close the file. Then we call the gcount() member of the stream to find out how much data we read and then

09:54.000 --> 09:55.140
we print out what we got.

09:56.400 --> 09:57.600
So let's see what happens.

10:00.510 --> 10:05.430
So we read 12 bytes. And we get x equals 1, y equals 2.

10:10.960 --> 10:17.350
If we look at this file that we have generated, we see we get 61. So that is the ASCII code for the letter

10:17.380 --> 10:20.530
'a'. Then we get to these three padding bytes.

10:21.820 --> 10:23.380
Then we get the value

10:23.380 --> 10:29.020
1 for x, which is a 4 bytes int (If my mouse can reach that far!)

10:29.410 --> 10:32.650
So this int is four bytes into the file, so it is on a multiple of four.

10:33.250 --> 10:34.510
And then two for y.

10:34.780 --> 10:37.450
And again, that is 8 bytes into the file. So

10:37.450 --> 10:39.490
that is also a multiple of 4.

10:43.950 --> 10:46.230
If we try it again with these pragmas...

10:49.300 --> 10:50.920
So then we get 9 bytes.

10:52.720 --> 10:57.310
So in this case, the ints follow directly after the character. There are not any padding points between

10:57.430 --> 10:59.740
the character and the first int.

11:01.030 --> 11:04.090
And if we look at the file we see - there it is.

11:04.090 --> 11:08.860
There's the letter 'a'. And the value 1 follows straight after it.

11:08.890 --> 11:10.690
So this is actually on offset 2.

11:11.860 --> 11:13.180
There is no padding here.

11:16.800 --> 11:19.080
OK, so that is it for this video.

11:19.500 --> 11:20.510
I will see you next time.

11:20.760 --> 11:22.650
Meanwhile, keep coding!