File I/O

Introduction

We’ll start our discussion of the UNIX System by describing the functions available for ﬁle I/O—open a ﬁle, read a ﬁle, write a ﬁle, and so on. Most ﬁle I/O on a UNIX system can be performed using only ﬁve functions: open, read, write, lseek, and close.We then examine the effect of various buffer sizes on the read and write functions.

The functions described in this chapter are often referred to as unbuffered I/O, in contrast to the standard I/O routines, which we describe in Chapter 5. The term unbuffered means that each read or write invokes a system call in the kernel. These unbuffered I/O functions are not part of ISO C, but are part of POSIX.1 and the Single UNIX Speciﬁcation.

Whenever we describe the sharing of resources among multiple processes, the concept of an atomic operation becomes important. We examine this concept with regard to ﬁle I/O and the arguments to the open function. This leads to a discussion of how ﬁles are shared among multiple processes and which kernel data structures are involved. After describing these features, we describe the dup, fcntl, sync, fsync, and ioctl functions.

File Descriptors

To the kernel, all open ﬁles are referred to by ﬁle descriptors. A ﬁle descriptor is a non-negative integer. When we open an existing ﬁle or create a new ﬁle, the kernel returns a ﬁle descriptor to the process. When we want to read or write a ﬁle, we identify the ﬁle with the ﬁle descriptor that was returned by open or creat as an argument to either read or write.

By convention, UNIX System shells associate ﬁle descriptor 0 with the standard input of a process, ﬁle descriptor 1 with the standard output, and ﬁle descriptor 2 with the standard error. This convention is used by the shells and many applications; it is not a feature of the UNIX kernel. Nevertheless, many applications would break if these associations weren’t followed.

Although their values are standardized by POSIX.1, the magic numbers 0, 1, and 2 should be replaced in POSIX-compliant applications with the symbolic constants STDIN_FILENO, STDOUT_FILENO, and STDERR_FILENO to improve readability.These constants are deﬁned in the <unistd.h> header.

File descriptors range from 0 through OPEN_MAX−1. (Recall Figure 2.11.) Early historical implementations of the UNIX System had an upper limit of 19, allowing a maximum of 20 open ﬁles per process, but many systems subsequently increased this limit to 63.

With FreeBSD 8.0, Linux 3.2.0, Mac OS X 10.6.8, and Solaris 10, the limit is essentially inﬁnite, bounded by the amount of memory on the system, the size of an integer, and any hard and soft limits conﬁgured by the system administrator.

open and openat Functions

A ﬁle is opened or created by calling either the open function or the openat function.

#include <fcntl.h>

int open(const char *path,int oﬂag,... /* mode_t mode */ );

int openat(int fd,const char *path,int oﬂag,... /* mode_t mode */ );

Both return: ﬁle descriptor if OK, −1 on error

We show the last argument as ..., which is the ISO C way to specify that the number and types of the remaining arguments may vary. For these functions, the last argument is used only when a new ﬁle is being created, as we describe later. We show this argument as a comment in the prototype.

The path parameter is the name of the ﬁle to open or create. This function has a multitude of options, which are speciﬁed by the oﬂag argument. This argument is formed by ORing together one or more of the following constants from the <fcntl.h> header:

O_RDONLY	Open for reading only.
O_WRONLY	Open for writing only.
O_RDWR	Open for reading and writing.
Most implementations deﬁne O_RDONLY as 0, O_WRONLY as 1, and O_RDWR as 2, for compatibility with older programs.
O_EXEC	Open for execute only.
O_SEARCH	Open for search only (applies to directories).
The purpose of the O_SEARCH constant is to evaluate search permissions at the time a directory is opened. Further operations using the directory’s ﬁle descriptor will not reevaluate permission to search the directory. None of the versions of the operating systems covered in this book support O_SEARCH yet.

One and only one of the previous ﬁve constants must be speciﬁed. The following constants are optional:

O_APPEND	Append to the end of ﬁle on each write. We describe this option in detail in Section 3.11.
O_CLOEXEC	Set the FD_CLOEXEC ﬁle descriptor ﬂag. We discuss ﬁle descriptor ﬂags in Section 3.14.
O_CREAT	Create the ﬁle if it doesn’t exist. This option requires a third argument to the open function (a fourth argument to the openat function) — the mode, which speciﬁes the access permission bits of the new ﬁle. (When we describe a ﬁle’s access permission bits in Section 4.5, we’ll see how to specify the mode and how it can be modiﬁed by the umask value of a process.)
O_DIRECTORY	Generate an error if path doesn’t refer to a directory.
O_EXCL	Generate an error if O_CREAT is also speciﬁed and the ﬁle already exists. This test for whether the ﬁle already exists and the creation of the ﬁle if it doesn’t exist is an atomic operation. We describe atomic operations in more detail in Section 3.11.
O_NOCTTY	If path refers to a terminal device, do not allocate the device as the controlling terminal for this process. We talk about controlling terminals in Section 9.6.
O_NOFOLLOW	Generate an error if path refers to a symbolic link. We discuss symbolic links in Section 4.17.
O_NONBLOCK	If path refers to a FIFO, a block special ﬁle, or a character special ﬁle, this option sets the nonblocking mode for both the opening of the ﬁle and subsequent I/O. We describe this mode in Section 14.2.
In earlier releases of System V, the O_NDELAY (no delay) ﬂag was introduced. This option is similar to the O_NONBLOCK (nonblocking) option, but an ambiguity was introduced in the return value from a read operation. The no-delay option causes a read operation to return 0 if there is no data to be read from a pipe, FIFO, or device, but this conﬂicts with a return value of 0, indicating an end of ﬁle. SVR4-based systems still support the no-delay option, with the old semantics, but new applications should use the nonblocking option instead.
O_SYNC	Have each write wait for physical I/O to complete, including I/O necessary to update ﬁle attributes modiﬁed as a result of the write. We use this option in Section 3.14.
O_TRUNC	If the ﬁle exists and if it is successfully opened for either write-only or read–write, truncate its length to 0.
O_TTY_INIT	When opening a terminal device that is not already open, set the nonstandard termios parameters to values that result in behavior that conforms to the Single UNIX Speciﬁcation. We discuss the termios structure when we discuss terminal I/O in Chapter 18.

The following two ﬂags are also optional. They are part of the synchronized input and output option of the Single UNIX Speciﬁcation (and thus POSIX.1).

O_DSYNC	Have each write wait for physical I/O to complete, but don’t wait for ﬁle attributes to be updated if they don’t affect the ability to read the data just written.
The O_DSYNC and O_SYNC ﬂags are similar, but subtly different. The O_DSYNC ﬂag affects a ﬁle’s attributes only when they need to be updated to reﬂect a change in the ﬁle’s data (for example, update the ﬁle’s size to reﬂect more data). With the O_SYNC ﬂag, data and attributes are always updated synchronously. When overwriting an existing part of a ﬁle opened with the O_DSYNC ﬂag, the ﬁle times wouldn’t be updated synchronously. In contrast, if we had opened the ﬁle with the O_SYNC ﬂag, every write to the ﬁle would update the ﬁle’s times before the write returns, regardless of whether we were writing over existing bytes or appending to the ﬁle.
O_RSYNC	Have each read operation on the ﬁle descriptor wait until any pending writes for the same portion of the ﬁle are complete.
Solaris 10 supports all three synchronization ﬂags. Historically, FreeBSD (and thus Mac OS X) have used the O_FSYNC ﬂag, which has the same behavior as O_SYNC. Because the two ﬂags are equivalent, they deﬁne the ﬂags to have the same value. FreeBSD 8.0 doesn’t support the O_DSYNC or O_RSYNC ﬂags. Mac OS X doesn’t support the O_RSYNC ﬂag, but deﬁnes the O_DSYNC ﬂag, treating it the same as the O_SYNC ﬂag. Linux 3.2.0 supports the O_DSYNC ﬂag, but treats the O_RSYNC ﬂag the same as O_SYNC.

The ﬁle descriptor returned by open and openat is guaranteed to be the lowest numbered unused descriptor. This fact is used by some applications to open a new ﬁle on standard input, standard output, or standard error. For example, an application might close standard output — normally, ﬁle descriptor 1—and then open another ﬁle, knowing that it will be opened on ﬁle descriptor 1. We’ll see a better way to guarantee that a ﬁle is open on a given descriptor in Section 3.12, when we explore the dup2 function.

The fd parameter distinguishes the openat function from the open function. There are three possibilities:

1) The path parameter speciﬁes an absolute pathname. In this case, the fd parameter is ignored and the openat function behaves like the open function.

2) The path parameter speciﬁes a relative pathname and the fd parameter is a ﬁle descriptor that speciﬁes the starting location in the ﬁle system where the relative pathname is to be evaluated. The fd parameter is obtained by opening the directory wherethe relative pathname is to be evaluated.

3) The path parameter speciﬁes a relative pathname and the fd parameter has the special value AT_FDCWD. In this case, the pathname is evaluated starting in the current working directory and the openat function behaves like the open function.

The openat function is one of a class of functions added to the latest version of POSIX.1 to address two problems. First, it gives threads a way to use relative pathnames to open ﬁles in directories other than the current working directory. As we’ll see in Chapter 11, all threads in the same process share the same current working directory, so this makes it difﬁcult for multiple threads in the same process to work in different directories at the same time. Second, it provides a way to avoid time-of-checkto-time-of-use (TOCTTOU) errors.

The basic idea behind TOCTTOU errors is that a program is vulnerable if it makes two ﬁle-based function calls where the second call depends on the results of the ﬁrst call. Because the two calls are not atomic, the ﬁle can change between the two calls, thereby invalidating the results of the ﬁrst call, leading to a program error. TOCTTOU errors in the ﬁle system namespace generally deal with attempts to subvert ﬁle system permissions by tricking a privileged program into either reducing permissions on a privileged ﬁle or modifying a privileged ﬁle to open up a security hole. Wei and Pu [2005] discuss TOCTTOU weaknesses in the UNIX ﬁle system interface.

Filename and Pathname Truncation

What happens if NAME_MAX is 14 and we try to create a new ﬁle in the current directory with a ﬁlename containing 15 characters? Traditionally, early releases of System V, such as SVR2, allowed this to happen, silently truncating the ﬁlename beyond the 14th character. BSD-derived systems, in contrast, returned an error status, with errno set to ENAMETOOLONG. Silently truncating the ﬁlename presents a problem that affects more than simply the creation of new ﬁles. If NAME_MAX is 14 and a ﬁle exists whose name is exactly 14 characters, any function that accepts a pathname argument, such as open or stat, has no way to determine what the original name of the ﬁle was, as the original name might have been truncated.

With POSIX.1, the constant _POSIX_NO_TRUNC determines whether long ﬁlenames and long components of pathnames are truncated or an error is returned. As we saw in Chapter 2, this value can vary based on the type of the ﬁle system, and we can use fpathconf or pathconf to query a directory to see which behavior is supported.

Whether an error is returned is largely historical. For example, SVR4-based systems do not generate an error for the traditional System V ﬁle system, S5. For the BSD-style ﬁle system (known as UFS), however, SVR4-based systems do generate an error. Figure 2.20 illustrates another example: Solaris will return an error for UFS, but not for PCFS, the DOS-compatible ﬁle system, as DOS silently truncates ﬁlenames that don’t ﬁt in an 8.3 format. BSD-derived systems and Linux always return an error.

If _POSIX_NO_TRUNC is in effect, errno is set to ENAMETOOLONG, and an error status is returned if any ﬁlename component of the pathname exceeds NAME_MAX.

Most modern ﬁle systems support a maximum of 255 characters for ﬁlenames. Because ﬁlenames are usually shorter than this limit, this constraint tends to not present problems for most applications.

creat Function

A new ﬁle can also be created by calling the creat function.

#include <fcntl.h>

int creat(const char *path,mode_t mode);

Returns: ﬁle descriptor opened for write-only if OK, −1 on error

Note that this function is equivalent to

open(path,O_WRONLY | O_CREAT | O_TRUNC, mode);

Historically, in early versions of the UNIX System, the second argument to open could be only 0, 1, or 2. There was no way to open a ﬁle that didn’t already exist. Therefore, a separate system call, creat, was needed to create new ﬁles. With the O_CREAT and O_TRUNC options now provided by open,a separate creat function is no longer needed.

We’ll show how to specify mode in Section 4.5 when we describe a ﬁle’s access permissions in detail.

One deﬁciency with creat is that the ﬁle is opened only for writing. Before the new version of open was provided, if we were creating a temporary ﬁle that we wanted to write and then read back, we had to call creat, close, and then open. A better way is to use the open function, as in

open(path,O_RDWR | O_CREAT | O_TRUNC, mode);

close Function

An open ﬁle is closed by calling the close function.

#include <unistd.h>

int close(int fd);

Returns: 0 if OK, −1 on error

Closing a ﬁle also releases any record locks that the process may have on the ﬁle. We’ll discuss this point further in Section 14.3.

When a process terminates, all of its open ﬁles are closed automatically by the kernel. Many programs take advantage of this fact and don’t explicitly close open ﬁles. See the program in Figure1.4, for example.

重定向和管道的命令行简介

重定向

假设您想要一张 images 目录中所有以 .png 结尾的文件列表

$ ls images/*.png 1>file_list

这表示把该命令的标准输出(1)重定向到(>)file_list 文件。其中的 > 操作符是输出重定向符。如果要重定向到的文件不存在，它将被创建；如果它已经存在，那么它先前的内容将被覆盖。

该操作符默认的描述符就是标准输出，因此就不用在命令行上特意指出。所以，上述命令可以简化为

$ ls images/*.png >file_list

其结果是一样的。然后您就可以用某个文本文件查看器(比如 less)来查看。

现在，假定您想要知道这样的文件有多少

wc -l 0<file_list

其中的 < 操作符是输入重定向符，并且其默认重定向描述符是标准输入(即 0)。因此您只需

wc -l <file_list

假定您又想去掉其中所有文件的“扩展名”，并将结果保存到另一个文件。您只要将 sed 的标准输入重定向为 file_list，并将其输出重定向到结果文件 the_list

sed -e 's/.png$//g' <file_list >the_list

重定向标准错误输出也很有用。例如：您会想要知道在 /shared 中有哪些目录您不能够访问。一个办法是递归地列出该目录并重定向错误输出到某个文件，并且不要显示标准输出：

ls -R /shared >/dev/null 2>errors

这表示标准输出将被重定向到(>)/dev/null，并将标准错误输出(2)重定向到(>)errors 文件。

管道

管道在某种程度上是标准输入和标准输出重定向的结合。其原理同物理管道类似：一个进程向管道的一端发送数据，而另一个进程从该管道的另一端读取数据。如Figure 3.0，通过管道之后cmd1,cmd2的标准输出（standard output）不会显示在屏幕上面。

管道符是 |。

Figure 3.0 管道

让我们再来看看上述文件列表的例子。假设您想直接找出有多少对应的文件，而不想先将它们保存到一个临时文件，您可以

ls images/*.png | wc -l

这表示将 ls 命令的标准输出(即文件列表)重定向到 wc 命令的输入。这样您就直接得到了想要的结果。

注意：
1）管道命令只处理前一个命令正确输出（standard output），不处理错误输出（standard error）

2）管道命令右边命令，必须能够接收标准输入流（standard input）命令才行

您也可以使用下述命令得到“除去扩展名”的文件列表

ls images/*.png | sed -e 's/.png$//g' >the_list

或者，如果您想要直接查看结果而不想保存到某个文件：

ls images/*.png | sed -e 's/.png$//g' | less

lseek Function

Every open ﬁle has an associated ‘‘current ﬁle offset,’’ normally a non-negative integer that measures the number of bytes from the beginning of the ﬁle. (We describe some exceptions to the ‘‘non-negative’’ qualiﬁer later in this section.) Read and write operations normally start at the current ﬁle offset and cause the offset to be incremented by the number of bytes read or written. By default, this offset is initialized to 0 when a ﬁle is opened, unless the O_APPEND option is speciﬁed.

An open ﬁle’s offset can be set explicitly by calling lseek.

#include <unistd.h>

off_t lseek(int fd,off_t offset,int whence);

Returns: new ﬁle offset if OK, −1 on error

The interpretation of the offset depends on the value of the whence argument.

If whence is SEEK_SET, the ﬁle’s offset is set to offset bytes from the beginning of the ﬁle.
If whence is SEEK_CUR, the ﬁle’s offset is set to its current value plus the offset. The offset can be positive or negative.
If whence is SEEK_END, the ﬁle’s offset is set to the size of the ﬁle plus the offset. The offset can be positive or negative.

Because a successful call to lseek returns the new ﬁle offset, we can seek zero bytes from the current position to determine the current offset:

off_t currpos;
currpos = lseek(fd, 0, SEEK_CUR);

This technique can also be used to determine if a ﬁle is capable of seeking. If the ﬁle descriptor refers to a pipe, FIFO, or socket, lseek sets errno to ESPIPE and returns −1.

下列是较特别的使用方式:
欲将读写位置移到文件开头时: lseek(fd, 0, SEEK_SET);
欲将读写位置移到文件尾时: lseek(fd, 0, SEEK_END);
想要取得目前文件位置时: lseek(fd, 0, SEEK_CUR);

The three symbolic constants—SEEK_SET, SEEK_CUR, and SEEK_END—were introduced with System V. Prior to this, whence was speciﬁed as 0 (absolute), 1 (relative to the current offset), or 2 (relative to the end of ﬁle). Much software still exists with these numbers hard coded.
The character l in the name lseek means ‘‘long integer.’’ Before the introduction of the off_t data type, the offset argument and the return value were long integers. lseek was introduced with Version 7 when long integers were added to C. (Similar functionality was provided in Version 6 by the functions seek and tell.)

Example

The program in Figure3.1 tests its standard input to see whether it is capable of seeking.

/**
* 文件名: fileio/seek.c
* 内容：用于测试对其标准输入能否设置偏移量
* 时间: 2016年 08月 23日星期二 16:03:00 CST
* 作者：firewaywei
*
*/
#include"apue.h"
int
main(void)
{
if(lseek(STDIN_FILENO,0, SEEK_CUR)==-1)
{
printf("cannot seek ");
}
else
{
printf("seek OK ");
}
exit(0);
}

Figure 3.1 Test whether standard input is capable of seeking

If we invoke this program interactively, we get

$ ./a.out < /etc/passwd
seek OK
$ cat < /etc/passwd | ./a.out
cannot seek
$ ./a.out < /var/spool/cron/FIFO
cannot seek

Normally, a ﬁle’s current offset must be a non-negative integer. It is possible, however, that certain devices could allow negative offsets. But for regular ﬁles, the offset must be non-negative. Because negative offsets are possible, we should be careful to compare the return value from lseek as being equal to or not equal to −1, rather than testing whether it is less than 0.

The /dev/kmem device on FreeBSD for the Intel x86 processor supports negative offsets. Because the offset (off_t) is a signed data type (Figure 2.21), we lose a factor of 2 in the maximum ﬁle size. If off_t is a 32-bit integer,the maximum ﬁle size is 2^³¹−1bytes.

lseek only records the current ﬁle offset within the kernel—it does not cause any I/O to take place. This offset is then used by the next read or write operation.

The ﬁle’s offset can be greater than the ﬁle’s current size, in which case the next write to the ﬁle will extend the ﬁle. This is referred to as creating a hole in a ﬁle and is allowed. Any bytes in a ﬁle that have not been written are read back as 0.

A hole in a ﬁle isn’t required to have storage backing it on disk. Depending on the ﬁle system implementation, when you write after seeking past the end of a ﬁle, new disk blocks might be allocated to store the data, but there is no need to allocate disk blocks for the data between the old end of ﬁle and the location where you start writing.

Example

The program shown in Figure 3.2 creates a file with a hole in it.

/**
* 文件名: fileio/hole.c
* 内容：用于创建一个具有空洞的文件。
* 时间: 2016年 08月 23日星期二 16:03:00 CST
* 作者：firewaywei
*/
#include"apue.h"
#include<fcntl.h>
char buf1[]="abcdefghij";
char buf2[]="ABCDEFGHIJ";
int
main(void)
{
int fd;
if((fd = creat("file.hole", FILE_MODE))<0)
{
err_sys("creat error");
}
/* offset now = 10 */
if(write(fd, buf1,10)!=10)
{
err_sys("buf1 write error");
}
/* offset now = 16384 */
if(lseek(fd,16384, SEEK_SET)==-1)
{
err_sys("lseek error");
}
/* offset now = 16394 */
if(write(fd, buf2,10)!=10)
{
err_sys("buf2 write error");
}
exit(0);
}

Figure 3.2 Create a file with a hole in it

Running this program gives us

$ ./hole
$ ll file.hole
-rw-r--r-- 1 fireway fireway 16394 8月 23 16:18 file.hole
fireway:~/study/apue.3e/fileio$ od -c file.hole
0000000 a b c d e f g h i j
0000020
*
0040000 A B C D E F G H I J
0040012
$

We use the od(1) command to look at the contents of the ﬁle. The -c ﬂag tells it to print the contents as characters. We can see that the unwritten bytes in the middle are read back as zero. The seven-digit number at the beginning of each line is the byte offset in octal.

To prove that there is really a hole in the ﬁle, let’s compare the ﬁle we just created with a ﬁle of the same size, but without holes:

$ ls -ls file.hole file.nohole // 比较长度
8 -rw-r--r-- 1 fireway fireway 16394 8月 23 16:18 file.hole
20 -rw-r--r-- 1 fireway fireway 16394 8月 23 16:37 file.nohole

Although both ﬁles are the same size, the ﬁle without holes consumes 20 disk blocks, whereas the ﬁle with holes consumes only 8 blocks.

In this example, we call the write function (Section 3.8). We’ll have more to say about ﬁles with holes in Section 4.12.

Because the offset address that lseek uses is represented by an off_t, implementations are allowed to support whatever size is appropriate on their particular platform. Most platforms today provide two sets of interfaces to manipulate ﬁle offsets: one set that uses 32-bit ﬁle offsets and another set that uses 64-bit ﬁle offsets.

The Single UNIX Speciﬁcation provides a way for applications to determine which environments are supported through the sysconf function (Section 2.5.4). Figure 3.3 summarizes the sysconf constants that are deﬁned.

Name of option	Description	name argument
_POSIX_V7_ILP32_OFF32	int, long,pointer,and off_t types are 32 bits.	_SC_V7_ILP32_OFF32
_POSIX_V7_ILP32_OFFBIG	int, long,and pointer types are 32 bits; off_t types are at least 64 bits.	_SC_V7_ILP32_OFFBIG
_POSIX_V7_LP64_OFF64	int types are 32 bits; long,pointer, and off_t types are 64 bits.	_SC_V7_LP64_OFF64
_POSIX_V7_LP64_OFFBIG	int types are at least 32 bits; long, pointer,and *off_t* types areat least 64 bits.	_SC_V7_LP64_OFFBIG

Figure 3.3 Data size options and name arguments to sysconf

The c99 compiler requires that we use the getconf(1) command to map the desired data size model to the ﬂags necessary to compile and link our programs. Different ﬂags and libraries might be needed, depending on the environments supported by each platform.

Unfortunately, this is one area in which implementations haven’t caught up to the standards. If your system does not match the latest version of the standard, the system might support the option names from the previous version of the Single UNIX Speciﬁcation: _POSIX_V6_ILP32_OFF32, _POSIX_V6_ILP32_OFFBIG, _POSIX_V6_LP64_OFF64, and _POSIX_V6_LP64_OFFBIG.
To get around this, applications can set the _FILE_OFFSET_BITS constant to 64 to enable 64-bit offsets. Doing so changes the deﬁnition of off_t to be a 64-bit signed integer. Setting _FILE_OFFSET_BITS to 32 enables 32-bit ﬁle offsets. Be aware, however, that although all four platforms discussed in this text support both 32-bit and 64-bit ﬁle offsets, setting _FILE_OFFSET_BITS is not guaranteed to be portable and might not have the desired effect.
Figure 3.4 summarizes the size in bytes of the off_t data type for the platforms covered in this book when an application doesn’t deﬁne _FILE_OFFSET_BITS, as well as the size when an application deﬁnes _FILE_OFFSET_BITS to have a value of either 32 or 64.

Operating system	CPU architecture	_FILE_OFFSET_BITS value
Operating system	CPU architecture	Undeﬁned	32	64
FreeBSD 8.0	x86 32-bit	8	8	8
Linux 3.2.0	x86 64-bit	8	8	8
Mac OS X 10.6.8	x86 64-bit	8	8	8
Solaris 10	SPARC 64-bit	8	4	8

Figure 3.4 Size in bytes of off_t for different platforms

Note that even though you might enable 64-bit ﬁle offsets, your ability to create a ﬁle larger than 2 GB (2^31−1bytes) depends on the underlying ﬁle system type.

read Function

Data is read from an open ﬁle with the read function.

#include <unistd.h>

ssize_t read(int fd,void *buf,size_t nbytes);

Returns: number of bytes read, 0 if end of ﬁle, −1 on error

If the read is successful, the number of bytes read is returned. If the end of ﬁle is encountered, 0 is returned.

There are several cases in which the number of bytes actually read is less than the amount requested:

When reading from a regular ﬁle, if the end of ﬁle is reached before the requested number of bytes has been read. For example, if 30 bytes remain until the end of ﬁle and we try to read 100 bytes, read returns 30. The next time we call read, it will return 0 (end of ﬁle).
When reading from a terminal device. Normally, up to one line is read at a time. (We’ll see how to change this default in Chapter 18.)
When reading from a network. Buffering within the network may cause less than the requested amount to be returned.
When reading from a pipe or FIFO. If the pipe contains fewer bytes than requested, read will return only what is available.
When reading from a record-oriented device. Some record-oriented devices, such as magnetic tape, can return up to a single record at a time.
When interrupted by a signal and a partial amount of data has already been read. We discuss this further in Section 10.5.

The read operation starts at the ﬁle’s current offset. Before a successful return, the offset is incremented by the number of bytes actually read.

POSIX.1 changed the prototype for this function in several ways. The classic deﬁnition is

int read(int fd,char *buf,unsigned nbytes);

First, the second argument was changed from char * to void * to be consistent with ISO C: the type void * is used for generic pointers.
Next, the return value was required to be a signed integer (ssize_t) to return a positive byte count, 0 (for end of ﬁle), or −1(for an error).
Finally, the third argument historically has been an unsigned integer, to allow a 16-bit implementation to read or write up to 65,534 bytes at a time. With the 1990 POSIX.1 standard, the primitive system data type ssize_t was introduced to provide the signed return value, and the unsigned size_t was used for the third argument. (Recall the SSIZE_MAX constant from Section 2.5.2.)

write Function

Data is written to an open ﬁle with the write function.

#include <unistd.h>

ssize_t write(int fd,const void *buf,size_t nbytes);

Returns: number of bytes written if OK, −1 on error

The return value is usually equal to the nbytes argument; otherwise, an error has occurred. A common cause for a write error is either ﬁlling up a disk or exceeding the ﬁle size limit for a given process (Section 7.11and Exercise 10.11).

For a regular ﬁle, the write operation starts at the ﬁle’s current offset. If the O_APPEND option was speciﬁed when the ﬁle was opened, the ﬁle’s offset is set to the current end of ﬁle before each write operation. After a successful write, the ﬁle’s offset is incremented by the number of bytes actually written.

I/O Efﬁciency

The program in Figure3.5 copies a ﬁle, using only the read and write functions.

/**
* 文件名: fileio/mycat.c
* 内容：只使用read和write函数复制一个文件.
* 时间: 2016年 08月 25日星期四 11:00:18 CST
* 作者：firewaywei
* 执行命令：
* ./a.out < infile > outfile
*/
#include"apue.h"
#define BUFFSIZE 4096
int
main(void)
{
int n;
char buf[BUFFSIZE];
while((n = read(STDIN_FILENO, buf, BUFFSIZE))>0)
{
if(write(STDOUT_FILENO, buf, n)!= n)
{
err_sys("write error");
}
}
if(n <0)
{
err_sys("read error");
}
exit(0);
}

Figure 3.5 Copy standardinput to standardoutput

The following caveats apply to this program.

It reads from standard input and writes to standard output, assuming that these have been set up by the shell before this program is executed. Indeed, all normal UNIX system shells provide a way to open a ﬁle for reading on standard input and to create (or rewrite) a ﬁle on standard output. This prevents the program from having to open the input and output ﬁles, and allows the user to take advantage of the shell’s I/O redirection facilities.
The program doesn’t close the input ﬁle or output ﬁle. Instead, the program uses the feature of the UNIX kernel that closes all open ﬁle descriptors in a process when that process terminates.
This example works for both text ﬁles and binary ﬁles, since there is no difference between the two to the UNIX kernel.

One question we haven’t answered, however, is how we chose the BUFFSIZE value. Before answering that, let’s run the program using different values for BUFFSIZE. Figure 3.6 shows the results for reading a 516,581,760-byte ﬁle, using 20 different buffer sizes.

BUFFSIZE	User CPU(seconds)	System CPU(seconds)	Clock time(seconds)	Number of loops
1	20.03	117.50	138.73	516,581,760
2	9.69	58.76	68.60	258,290,880
4	4.60	36.47	41.27	129,145,440
8	2.47	15.44	18.38	64,572,720
16	1.07	7.93	9.38	32,286,360
32	0.56	4.51	8.82	16,143,180
64	0.34	2.72	8.66	8,071,590
128	0.34	1.84	8.69	4,035,795
256	0.15	1.30	8.69	2,017,898
512	0.09	0.95	8.63	1,008,949
1,024	0.02	0.78	8.58	504,475
2,048	0.04	0.66	8.68	252,238
4,096	0.03	0.58	8.62	126,119
8,192	0.00	0.54	8.52	63,060
16,384	0.01	0.56	8.69	31,530
32,768	0.00	0.56	8.51	15,765
65,536	0.01	0.56	9.12	7,883
131,072	0.00	0.58	9.08	3,942
262,144	0.00	0.60	8.70	1,971
524,288	0.01	0.58	8.58	986

Figure 3.6 Timing results for reading with different buffer sizes on Linux

The file was read using the program shown in Figure 3.5, with standard output redirected to /dev/null. The file system used for this test was the Linux ext4 file system with 4,096-byte blocks. (The st_blksize value, which we describe in Section 4.12, is 4,096.) This accounts for the minimum in the system time occurring at the few timing measurements starting around a BUFFSIZE of 4,096. Increasing the buffer size beyond this limit has little positive effect.

/dev/null设备文件只有一个作用，往它里面写任何数据都被直接丢弃。因此保证了该命令执行时屏幕上没有任何输出，既不打印正常信息也不打印错误信息，让命令安静地执行，这种写法在Shell脚本中很常见。

Most file systems support some kind of read-ahead to improve performance. When sequential reads are detected, the system tries to read in more data than an application requests, assuming that the application will read it shortly. The effect of read-ahead can be seen in Figure 3.6, where the elapsed time for buffer sizes as small as 32 bytes is as good as the elapsed time for larger buffer sizes.

We’ll return to this timing example later in the text. In Section 3.14, we show the effect of synchronous writes; in Section 5.8, we compare these unbuffered I/O times with the standard I/O library.

Beware when trying to measure the performance of programs that read and write files. The operating system will try to cache the file incore, so if you measure the performance of the program repeatedly, the successive timings will likely be better than the first. This improvement occurs because the first run causes the file to be entered into the system’s cache, and successive runs access the file from the system’s cache instead of from the disk. (The term incore means in main memory. Back in the day, a computer ’s main memory was built out of ferrite core. This is where the phrase ‘‘core dump’’ comes from: the main memory image of a program stored in a file on disk for diagnosis.)
In the tests reported in Figure 3.6, each run with a different buffer size was made using a different copy of the file so that the current run didn’t find the data in the cache from the previous run. The files are large enough that they all don’t remain in the cache (the test system was configured with 6 GB of RAM).

File Sharing

The UNIX System supports the sharing of open files among different processes. Before describing the dup function, we need to describe this sharing. To do this, we’ll examine the data structures used by the kernel for all I/O.

The following description is conceptual; it may or may not match a particular implementation. Refer to Bach [1986] for a discussion of these structures in System V. McKusick et al. [1996] describe these structures in 4.4BSD. McKusick and Neville-Neil [2005] cover FreeBSD 5.2. For a similar discussion of Solaris, see McDougall and Mauro [2007]. The Linux 2.6 kernel architecture is discussed in Bovet and Cesati [2006].

The kernel uses three data structures to represent an open file, and the relationships among them determine the effect one process has on another with regard to file sharing.

1) Every process has an entry in the process table. Within each process table entry is a table of open file descriptors, which we can think of as a vector, with one entry per descriptor. Associated with each file descriptor are

The file descriptor flags (close-on-exec; refer to Figure 3.7 and Section 3.14)
A pointer to a file table entry

A table containing all of the information that must be saved when the CPU switches from running one process to another in a multitasking system.
The information in the process table allows the suspended process to be restarted at a later time as if it had never been stopped. Every process has an entry in the table. These entries are known as process control blocks and contain the following information:
1) process state - information needed so that the process can be loaded into memory and run, such as the program counter, the stack pointer, and the values of registers.
2) memory state - details of the memory allocation such as pointers to the various memory areas used by the program
3) resource state - information regarding the status of files being used by the process such as user ID.
A process that has died but still has an entry in the process table is called a zombie process.

Figure 3.6.1 process table

2) The kernel maintains a file table for all open files. Each file table entry contains

The file status flags for the file, such as read, write, append, sync, and nonblocking; more on these in Section 3.14
The current file offset
A pointer to the v-node table entry for the file

3) Each open file (or device) has a v-node structure that contains information about the type of file and pointers to functions that operate on the file. For most files, the v-node also contains the i-node for the file. This information is read from disk when the file is opened, so that all the pertinent information about the file is available. For example, the i-node contains the owner of the file, the size of the file, pointers to where the actual data blocks for the file are located on disk, and so on. (We talk more about i-nodes in Section 4.14 when we describe the typical UNIX file system in more detail.)

Linux has no v-node. Instead, a generic i-node structure is used. Although the implementations differ, the v-node is conceptually the same as a generic i-node. Both point to an i-node structure specific to the file system.

We’re ignoring some implementation details that don’t affect our discussion. For example, the table of open file descriptors can be stored in the user area (a separate per-process structure that can be paged out) instead of the process table. Also, these tables can be implemented in numerous ways—they need not be arrays; one alternate implementation is a linked lists of structures. Regardless of the implementation details, the general concepts remain the same.

Figure 3.7 shows a pictorial arrangement of these three tables for a single process that has two different files open: one file is open on standard input (file descriptor 0), and the other is open on standard output (file descriptor 1).

Figure 3.7 Kernel data structures for open files

The arrangement of these three tables has existed since the early versions of the UNIX System [Thompson 1978]. This arrangement is critical to the way files are shared among processes. We’ll return to this figure in later chapters, when we describe additional ways that files are shared.

The v-node was invented to provide support for multiple file system types on a single computer system. This work was done independently by Peter Weinberger (Bell Laboratories) and Bill Joy (Sun Microsystems). Sun called this the Virtual File System and called the file system–independent portion of the i-node the v-node [Kleiman 1986]. The v-node propagated through various vendor implementations as support for Sun’s Network File System (NFS) was added. The first release from Berkeley to provide v-nodes was the 4.3BSD Reno release, when NFS was added.
In SVR4, the v-node replaced the file system–independent i-node of SVR3. Solaris is derived from SVR4 and, therefore, uses v-nodes.
Instead of splitting the data structures into a v-node and an i-node, Linux uses a file system–independent i-node and a file system–dependent i-node.

If two independent processes have the same file open, we could have the arrangement shown in Figure 3.8.

Figure 3.8 Two independent processes with the same file open

We assume here that the first process has the file open on descriptor 3 and that the second process has that same file open on descriptor 4. Each process that opens the file gets its own file table entry, but only a single v-node table entry is required for a given file. One reason each process gets its own file table entry is so that each process has its own current offset for the file.

Given these data structures, we now need to be more specific about what happens with certain operations that we’ve already described.

After each write is complete, the current file offset in the file table entry is incremented by the number of bytes written. If this causes the current file offset to exceed the current file size, the current file size in the i-node table entry is set to the current file offset (for example, the file is extended).
If a file is opened with the O_APPEND flag, a corresponding flag is set in the file status flags of the file table entry. Each time a write is performed for a file with this append flag set, the current file offset in the file table entry is first set to the current file size from the i-node table entry. This forces every write to be appended to the current end of file.
If a file is positioned to its current end of file using lseek, all that happens is the current file offset in the file table entry is set to the current file size from the i-node table entry. (Note that this is not the same as if the file was opened with the O_APPEND flag, as we will see in Section 3.11.)
The lseek function modifies only the current file offset in the file table entry. No I/O takes place.

It is possible for more than one file descriptor entry to point to the same file table entry, as we’ll see when we discuss the dup function in Section 3.12. This also happens after a fork when the parent and the child share the same file table entry for each open descriptor (Section 8.3).

Note the difference in scope between the file descriptor flags and the file status flags. The former apply only to a single descriptor in a single process, whereas the latter apply to all descriptors in any process that point to the given file table entry. When we describe the fcntl function in Section 3.14, we’ll see how to fetch and modify both the file descriptor flags and the file status flags.

Figure 3.8.1 process table and file table

Everything that we’ve described so far in this section works fine for multiple processes that are reading the same file. Each process has its own file table entry with its own current file offset. Unexpected results can arise, however, when multiple processes write to the same file. To see how to avoid some surprises, we need to understand the concept of atomic operations.

Atomic Operations

Appending to a File

Consider a single process that wants to append to the end of a file. Older versions of the UNIX System didn’t support the O_APPEND option to open, so the program was coded as follows:

if(lseek(fd,0L,2)<0)/* position to EOF */
{
err_sys("lseek error");
}
if(write(fd, buf,100)!=100)/* and write */
{
err_sys("write error");
}

This works fine for a single process, but problems arise if multiple processes use this technique to append to the same file. (This scenario can arise if multiple instances of the same program are appending messages to a log file, for example.)

Assume that two independent processes, A and B, are appending to the same file. Each has opened the file but without the O_APPEND flag. This gives us the same picture as Figure 3.8. Each process has its own file table entry, but they share a single v-node table entry. Assume that process A does the lseek and that this sets the current offset for the file for process A to byte offset 1,500 (the current end of file). Then the kernel switches processes, and B continues running. Process B then does the lseek, which sets the current offset for the file for process B to byte offset 1,500 also (the current end of file). Then B calls write, which increments B’s current file offset for the file to 1,600. Because the file’s size has been extended, the kernel also updates the current file size in the v-node to 1,600. Then the kernel switches processes and A resumes. When A calls write, the data is written starting at the current file offset for A, which is byte offset 1,500. This overwrites the data that B wrote to the file.

The problem here is that our logical operation of ‘‘position to the end of file and write’’ requires two separate function calls (as we’ve shown it). The solution is to have the positioning to the current end of file and the write be an atomic operation with regard to other processes. Any operation that requires more than one function call cannot be atomic, as there is always the possibility that the kernel might temporarily suspend the process between the two function calls (as we assumed previously).

The UNIX System provides an atomic way to do this operation if we set the O_APPEND flag when a file is opened. As we described in the previous section, this causes the kernel to position the file to its current end of file before each write. We no longer have to call lseek before each write.

pread and pwrite Functions

The Single UNIX Specification includes two functions that allow applications to seek and perform I/O atomically: pread and pwrite.

#include <unistd.h>

ssize_t pread(int fd, void *buf, size_t nbytes, off_t offset);
Returns: number of bytes read, 0 if end of file, −1 on error

ssize_t pwrite(int fd, const void *buf, size_t nbytes, off_t offset);
Returns: number of bytes written if OK, −1 on error

Calling pread is equivalent to calling lseek followed by a call to read, with the following exceptions.

There is no way to interrupt the two operations that occur when we call pread.
The current file offset is not updated.

Calling pwrite is equivalent to calling lseek followed by a call to write, with similar exceptions.

Creating a File

We saw another example of an atomic operation when we described the O_CREAT and O_EXCL options for the open function. When both of these options are specified, the open will fail if the file already exists. We also said that the check for the existence of the file and the creation of the file was performed as an atomic operation. If we didn’t have this atomic operation, we might try

if((fd = open(path, O_WRONLY))<0)
{
if(errno == ENOENT)
{
if((fd = creat(path, mode))<0)
{
err_sys("creat error");
}
}
else
{
err_sys("open error");
}
}

The problem occurs if the file is created by another process between the open and the creat. If the file is created by another process between these two function calls, and if that other process writes something to the file, that data is erased when this creat is executed. Combining the test for existence and the creation into a single atomic operation avoids this problem.

In general, the term atomic operation refers to an operation that might be composed of multiple steps. If the operation is performed atomically, either all the steps are performed (on success) or none are performed (on failure). It must not be possible for only a subset of the steps to be performed. We’ll return to the topic of atomic operations when we describe the link function (Section 4.15) and record locking (Section 14.3).

dup and dup2 Functions

An existing file descriptor is duplicated by either of the following functions:

#include <unistd.h>
int dup(int fd);
int dup2(int fd, int fd2);
Both return: new file descriptor if OK, −1 on error

The new file descriptor returned by dup is guaranteed to be the lowest-numbered available file descriptor. With dup2, we specify the value of the new descriptor with the fd2 argument. If fd2 is already open, it is first closed. If fd equals fd2, then dup2 returns fd2 without closing it. Otherwise, the FD_CLOEXEC file descriptor flag is cleared for fd2, so that fd2 is left open if the process calls exec.

The new file descriptor that is returned as the value of the functions shares the same file table entry as the fd argument. We show this in Figure 3.9.

Figure 3.9 Kernel data structures after dup(1)

In this figure, we assume that when it’s started, the process executes

newfd = dup(1);

We assume that the next available descriptor is 3 (which it probably is, since 0, 1, and 2 are opened by the shell). Because both descriptors point to the same file table entry, they share the same file status flags—read, write, append, and so on—and the same current file offset.

Each descriptor has its own set of file descriptor flags. As we describe in Section 3.14, the close-on-exec file descriptor flag for the new descriptor is always cleared by the dup functions.

Another way to duplicate a descriptor is with the fcntl function, which we describe in Section 3.14. Indeed, the call

dup(fd);

is equivalent to

fcntl(fd, F_DUPFD, 0);

Similarly, the call

dup2(fd, fd2);

is equivalent to

close(fd2);
fcntl(fd, F_DUPFD, fd2);

In this last case, the dup2 is not exactly the same as a close followed by an fcntl. The differences are as follows:

dup2 is an atomic operation, whereas the alternate form involves two function calls. It is possible in the latter case to have a signal catcher called between the close and the fcntl that could modify the file descriptors. (We describe signals in Chapter 10.) The same problem could occur if a different thread changes the file descriptors. (We describe threads in Chapter 11.)
There are some errno differences between dup2 and fcntl.

The dup2 system call originated with Version 7 and propagated through the BSD releases. The fcntl method for duplicating file descriptors appeared with System III and continued with System V. SVR3.2 picked up the dup2 function, and 4.2BSD picked up the fcntl function and the F_DUPFD functionality. POSIX.1 requires both dup2 and the F_DUPFD feature of fcntl.

sync, fsync, and fdatasync Functions

Traditional implementations of the UNIX System have a buffer cache or page cache in the kernel through which most disk I/O passes. When we write data to a file, the data is normally copied by the kernel into one of its buffers and queued for writing to disk at some later time. This is called delayed write. (Chapter 3 of Bach [1986] discusses this buffer cache in detail.)

The kernel eventually writes all the delayed-write blocks to disk, normally when it needs to reuse the buffer for some other disk block. To ensure consistency of the file system on disk with the contents of the buffer cache, the sync, fsync, and fdatasync functions are provided.

#include <unistd.h>

int fsync(int fd);

int fdatasync(int fd);

Returns: 0 if OK, −1 on error

void sync(void);

The sync function simply queues all the modified block buffers for writing and returns; it does not wait for the disk writes to take place.

The function sync is normally called periodically (usually every 30 seconds) from a system daemon, often called update. This guarantees regular flushing of the kernel’s block buffers. The command sync(1) also calls the sync function.

The function fsync refers only to a single file, specified by the file descriptor fd, and waits for the disk writes to complete before returning. This function is used when an application, such as a database, needs to be sure that the modified blocks have been written to the disk.

The fdatasync function is similar to fsync, but it affects only the data portions of a file. With fsync, the file’s attributes are also updated synchronously.

All four of the platforms described in this book support sync and fsync. However, FreeBSD 8.0 does not support fdatasync.

fcntl Function

The fcntl function can change the properties of a file that is already open.

#include <fcntl.h>
int fcntl(int fd, int cmd, ... /* int arg */ );
Returns: depends on cmd if OK (see following), −1 on error

In the examples in this section, the third argument is always an integer, corresponding to the comment in the function prototype just shown. When we describe record locking in Section 14.3, however, the third argument becomes a pointer to a structure.

The fcntl function is used for five different purposes.

Duplicate an existing descriptor (cmd = F_DUPFD or F_DUPFD_CLOEXEC)
Get/set file descriptor flags (cmd = F_GETFD or F_SETFD)
Get/set file status flags (cmd = F_GETFL or F_SETFL)
Get/set asynchronous I/O ownership (cmd = F_GETOWN or F_SETOWN)
Get/set record locks (cmd = F_GETLK, F_SETLK, or F_SETLKW)

We’ll now describe the first 8 of these 11 cmd values. (We’ll wait until Section 14.3 to describe the last 3, which deal with record locking.) Refer to Figure 3.7, as we’ll discuss both the file descriptor flags associated with each file descriptor in the process table entry and the file status flags associated with each file table entry.

F_DUPFD	Duplicate the file descriptor fd. The new file descriptor is returned as the value of the function. It is the lowest-numbered descriptor that is not already open, and that is greater than or equal to the third argument (taken as an integer). The new descriptor shares the same file table entry as fd. (Refer to Figure 3.9.) But the new descriptor descriptor flags, and its FD_CLOEXEC file descriptor flag is cleared. (This means that the descriptor is left open across an exec, which we discuss in Chapter 8.)
F_DUPFD_CLOEXEC	Duplicate the file descriptor and set the FD_CLOEXEC file descriptor flag associated with the new descriptor. Returns the new file descriptor.
F_GETFD	Return the file descriptor flags for fd as the value of the function. Currently, only one file descriptor flag is defined: the FD_CLOEXEC flag.
F_SETFD	Set the file descriptor flags for fd. The new flag value is set from the third argument (taken as an integer).
Be aware that some existing programs that deal with the file descriptor flags don’t use the constant FD_CLOEXEC. Instead, these programs set the flag to either 0 (don’t close-on-exec, the default) or 1 (do close-on-exec).
F_GETFL	Return the file status flags for fd as the value of the function. We described the file status flags when we described the open function. They are listed in Figure 3.10. Unfortunately, the five access-mode flags—O_RDONLY, O_WRONLY, O_RDWR, O_EXEC, and O_SEARCH—are not separate bits that can be tested. (As we mentioned earlier, the first three often have the values 0, 1, and 2, respectively, for historical reasons. Also, these five values are mutually exclusive; a file can have only one of them enabled.) Therefore, we must first use the O_ACCMODE mask to obtain the access-mode bits and then compare the result against any of the five values.
F_SETFL	Set the file status flags to the value of the third argument (taken as an integer). The only flags that can be changed are O_APPEND, O_NONBLOCK, O_SYNC, O_DSYNC, O_RSYNC, O_FSYNC, and O_ASYNC.
F_GETOWN	Get the process ID or process group ID currently receiving the SIGIO and SIGURG signals. We describe these asynchronous I/O signals in Section 14.5.2.
F_SETOWN	Set the process ID or process group ID to receive the SIGIO and SIGURG signals. A positive arg specifies a process ID. A negative arg implies a process group ID equal to the absolute value of arg.

File status flag	Description
O_RDONLY	open for reading only
O_WRONLY	open for writing only
O_RDWR	open for reading and writing
O_EXEC	open for execute only
O_SEARCH	open directory for searching only
O_APPEND	append on each write
O_NONBLOCK	nonblocking mode
O_SYNC	wait for writes to complete (data and attributes)
O_DSYNC	wait for writes to complete (data only)
O_RSYNC	synchronize reads and writes
O_FSYNC	wait for writes to complete (FreeBSD and Mac OS X only)
O_ASYNC	asynchronous I/O (FreeBSD and Mac OS X only)

Figure 3.10 File status flags for fcntl

The return value from fcntl depends on the command. All commands return −1 on an error or some other value if OK. The following four commands have special return values: F_DUPFD, F_GETFD, F_GETFL, and F_GETOWN. The first command returns the new file descriptor, the next two return the corresponding flags, and the final command returns a positive process ID or a negative process group ID.

Example

The program in Figure 3.11 takes a single command-line argument that specifies a file descriptor and prints a description of selected file flags for that descriptor.

/**
* 文件名: fileio/fileflags.c
* 内容：程序的第1个参数指定文件描述符，并对于该描述符打印其所选择的文件标志说明
* 时间: 2016年 08月 28日星期日 21:24:48 CST
* 作者：firewaywei
*/
#include"apue.h"
#include<fcntl.h>
int
main(int argc,char*argv[])
{
int val;
if(argc !=2)
{
err_quit("usage: a.out <descriptor#>");
}
if((val = fcntl(atoi(argv[1]), F_GETFL,0))<0)
{
err_sys("fcntl error for fd %d", atoi(argv[1]));
}
switch(val & O_ACCMODE)
{
case O_RDONLY:
printf("read only");
break;
case O_WRONLY:
printf("write only");
break;
case O_RDWR:
printf("read write");
break;
default:
err_dump("unknown access mode");
}
if(val & O_APPEND)
{
printf(", append");
}
if(val & O_NONBLOCK)
{
printf(", nonblocking");
}
if(val & O_SYNC)
{
printf(", synchronous writes");
}
#if !defined(_POSIX_C_SOURCE) && defined(O_FSYNC) && (O_FSYNC != O_SYNC)
if(val & O_FSYNC)
{
printf(", synchronous writes");
}
#endif
putchar(' ');
exit(0);
}

Figure 3.11 Print file flags for specified descriptor

Note that we use the feature test macro _POSIX_C_SOURCE and conditionally compile the file access flags that are not part of POSIX.1. The following script shows the operation of the program, when invoked from bash (the Bourne-again shell). Results will vary, depending on which shell you use.

$ ./fileflags 0 < /dev/tty
read only
$ ./fileflags 1 > temp.foo
$ cat temp.foo
write only
$ ./fileflags 2 2>>temp.foo
write only, append
$ ./fileflags 5 5<>temp.foo
read write

The clause 5<>temp.foo opens the file temp.foo for reading and writing on file descriptor 5.

Example

When we modify either the file descriptor flags or the file status flags, we must be careful to fetch the existing flag value, modify it as desired, and then set the new flag value. We can’t simply issue an F_SETFD or an F_SETFL command, as this could turn off flag bits that were previously set.

Figure 3.12 shows a function that sets one or more of the file status flags for a descriptor.

/**
* 文件名: fileio/setfl.c
* 内容：对于一个文件描述符设置一个或多个文件状态标志的函数
* 时间: 2016年 08月 29日星期一 13:00:12 CST
* 作者：firewaywei
*/
#include"apue.h"
#include<fcntl.h>
void
set_fl(int fd,int flags)/* flags are file status flags to turn on */
{
int val;
if((val = fcntl(fd, F_GETFL,0))<0)
{
err_sys("fcntl F_GETFL error");
}
val |= flags;/* turn on flags */
if(fcntl(fd, F_SETFL, val)<0)
{
err_sys("fcntl F_SETFL error");
}
}

Figure 3.12 Turn on one or more of the file status flags for a descriptor

If we change the middle statement to

val &= ~falgs; /* turn flags off */

we have a function named clr_fl, which we’ll use in some later examples. This statement logically ANDs the one’s complement of flags with the current val.

If we add the line

set_fl(STDOUT_FILENO, O_SYNC);

to the beginning of the program shown in Figure 3.5, we’ll turn on the synchronous-write flag. This causes each write to wait for the data to be written to disk before returning. Normally in the UNIX System, a write only queues the data for writing; the actual disk write operation can take place sometime later. A database system is a likely candidate for using O_SYNC, so that it knows on return from a write that the data is actually on the disk, in case of an abnormal system failure.

We expect the O_SYNC flag to increase the system and clock times when the program runs. To test this, we can run the program in Figure 3.5, copying 492.6 MB of data from one file on disk to another and compare this with a version that does the same thing with the O_SYNC flag set. The results from a Linux system using the ext4 file system are shown in Figure 3.13.

Operation	User CPU (seconds)	System CPU (seconds)	Clock time (seconds)
read time from Figure 3.6 for BUFFSIZE = 4,096	0.03	0.58	8.62
normal write to disk file	0.00	1.05	9.70
write to disk file with O_SYNC set	0.02	1.09	10.28
write to disk followed by fdatasync	0.02	1.14	17.93
write to disk followed by fsync	0.00	1.19	18.17
write to disk with O_SYNC set followed by fsync	0.02	1.15	17.88

Figure 3.13 Linux ext4 timing results using various synchronization mechanisms

The six rows in Figure 3.13 were all measured with a BUFFSIZE of 4,096 bytes. The results in Figure 3.6 were measured while reading a disk file and writing to /dev/null, so there was no disk output. The second row in Figure 3.13 corresponds to reading a disk file and writing to another disk file. This is why the first and second rows in Figure 3.13 are different. The system time increases when we write to a disk file, because the kernel now copies the data from our process and queues the data for writing by the disk driver. We expect the clock time to increase as well when we write to a disk file.

When we enable synchronous writes, the system and clock times should increase significantly. As the third row shows, the system time for writing synchronously is not much more expensive than when we used delayed writes. This implies that the Linux operating system is doing the same amount of work for delayed and synchronous writes (which is unlikely), or else the O_SYNC flag isn’t having the desired effect. In this case, the Linux operating system isn’t allowing us to set the O_SYNC flag using fcntl, instead failing without returning an error (but it would have honored the flag if we were able to specify it when the file was opened).

The clock time in the last three rows reflects the extra time needed to wait for all of the writes to be committed to disk. After writing a file synchronously, we expect that a call to fsync will have no effect. This case is supposed to be represented by the last row in Figure 3.13, but since the O_SYNC flag isn’t having the intended effect, the last row behaves the same way as the fifth row.

Figure 3.14 shows timing results for the same tests run on Mac OS X 10.6.8, which uses the HFS file system. Note that the times match our expectations: synchronous writes are far more expensive than delayed writes, and using fsync with synchronous writes makes very little difference. Note also that adding a call to fsync at the end of the delayed writes makes little measurable difference. It is likely that the operating system flushed previously written data to disk as we were writing new data to the file, so by the time that we called fsync, very little work was left to be done.

Operation	User CPU (seconds)	System CPU (seconds)	Clock time (seconds)
write to /dev/null	0.14	1.02	5.28
normal write to disk file	0.14	3.21	17.04
write to disk file with O_SYNC set	0.39	16.89	60.82
write to disk followed by fsync	0.13	3.07	17.10
write to disk with O_SYNC set followed by fsync	0.39	18.18	62.39

Figure 3.14 Mac OS X HFS timing results using various synchronization mechanisms

Compare fsync and fdatasync, both of which update a file’s contents when we say so, with the O_SYNC flag, which updates a file’s contents every time we write to the file. The performance of each alternative will depend on many factors, including the underlying operating system implementation, the speed of the disk drive, and the type of the file system.

With this example, we see the need for fcntl. Our program operates on a descriptor (standard output), never knowing the name of the file that was opened on that descriptor. We can’t set the O_SYNC flag when the file is opened, since the shell opened the file. With fcntl, we can modify the properties of a descriptor, knowing only the descriptor for the open file. We’ll see another need for fcntl when we describe nonblocking pipes (Section 15.2), since all we have with a pipe is a descriptor.

ioctl Function

The ioctl function has always been the catchall for I/O operations. Anything that couldn’t be expressed using one of the other functions in this chapter usually ended up being specified with an ioctl. Terminal I/O was the biggest user of this function. (When we get to Chapter 18, we’ll see that POSIX.1 has replaced the terminal I/O operations with separate functions.)

#include <unistd.h>/* System V */
#include <sys/ioctl.h>/* BSD and Linux */
int ioctl(int fd, int request, ...);
Returns: −1 on error, something else if OK

The ioctl function was included in the Single UNIX Specification only as an extension for dealing with STREAMS devices [Rago 1993], but it was moved to obsolescent status in SUSv4. UNIX System implementations use ioctl for many miscellaneous device operations. Some implementations have even extended it for use with regular files.

The prototype that we show corresponds to POSIX.1. FreeBSD 8.0 and Mac OS X 10.6.8 declare the second argument as an unsigned long. This detail doesn’t matter, since the second argument is always a #defined name from a header.

For the ISO C prototype, an ellipsis is used for the remaining arguments. Normally, however, there is only one more argument, and it’s usually a pointer to a variable or a structure.

In this prototype, we show only the headers required for the function itself. Normally, additional device-specific headers are required. For example, the ioctl commands for terminal I/O, beyond the basic operations specified by POSIX.1, all require the <termios.h> header.

Each device driver can define its own set of ioctl commands. The system, however, provides generic ioctl commands for different classes of devices. Examples of some of the categories for these generic ioctl commands supported in FreeBSD are summarized in Figure 3.15.

Category	Constant names	Header	Number of ioctls
disk labels	DIOxxx	<sys/disklabel.h>	4
file I/O	FIOxxx	<sys/filio.h>	14
mag tape I/O	MTIOxxx	<sys/mtio.h>	11
socket I/O	SIOxxx	<sys/sockio.h>	73
terminal I/O	TIOxxx	<sys/ttycom.h>	43

Figure 3.15 Common FreeBSD ioctl operations

The mag tape operations allow us to write end-of-file marks on a tape, rewind a tape, space forward over a specified number of files or records, and the like. None of these operations is easily expressed in terms of the other functions in the chapter (read, write, lseek, and so on), so the easiest way to handle these devices has always been to access their operations using ioctl.

We use the ioctl function in Section 18.12 to fetch and set the size of a terminal’s window, and in Section 19.7 when we access the advanced features of pseudo terminals.

/dev/fd

Newer systems provide a directory named /dev/fd whose entries are files named 0, 1, 2, and so on. Opening the file /dev/fd/n is equivalent to duplicating descriptor n, assuming that descriptor n is open.

The /dev/fd feature was developed by Tom Duff and appeared in the 8th Edition of the Research UNIX System. It is supported by all of the systems described in this book: FreeBSD 8.0, Linux 3.2.0, Mac OS X 10.6.8, and Solaris 10. It is not part of POSIX.1.

In the function call

fd = open("/dev/fd/0", mode);

most systems ignore the specified mode, whereas others require that it be a subset of the mode used when the referenced file (standard input, in this case) was originally opened. Because the previous open is equivalent to

fd = dup(0);

the descriptors 0 and fd share the same file table entry (Figure 3.9). For example, if descriptor 0 was opened read-only, we can only read on fd. Even if the system ignores the open mode and the call

fd = open("/dev/fd/0", O_RDWR);

succeeds, we still can’t write to fd.

The Linux implementation of /dev/fd is an exception. It maps file descriptors into symbolic links pointing to the underlying physical files. When you open /dev/fd/0, for example, you are really opening the file associated with your standard input. Thus the mode of the new file descriptor returned is unrelated to the mode of the /dev/fd file descriptor.

We can also call creat with a /dev/fd pathname argument as well as specify O_CREAT in a call to open. This allows a program that calls creat to still work if the pathname argument is /dev/fd/1, for example.

Beware of doing this on Linux. Because the Linux implementation uses symbolic links to the real files, using creat on a /dev/fd file will result in the underlying file being truncated.

Some systems provide the pathnames /dev/stdin, /dev/stdout, and /dev/stderr. These pathnames are equivalent to /dev/fd/0, /dev/fd/1, and /dev/fd/2, respectively.

The main use of the /dev/fd files is from the shell. It allows programs that use pathname arguments to handle standard input and standard output in the same manner as other pathnames. For example, the cat(1) program specifically looks for an input filename of - and uses it to mean standard input. The command

filter file2 | cat file1 - file3 | lpr

is an example. First, cat reads file1, then its standard input (the output of the filter program on file2), and then file3. If /dev/fd is supported, the special handling of - can be removed from cat, and we can enter

filter file2 | cat file1 /dev/fd/0 file3 | lpr

The special meaning of - as a command-line argument to refer to the standard input or the standard output is a kludge that has crept into many programs. There are also problems if we specify - as the first file, as it looks like the start of another command-line option. Using /dev/fd is a step toward uniformity and cleanliness.

Summary

This chapter has described the basic I/O functions provided by the UNIX System. These are often called the unbuffered I/O functions because each read or write invokes a system call into the kernel. Using only read and write, we looked at the effect of various I/O sizes on the amount of time required to read a file. We also looked at several ways to flush written data to disk and their effect on application performance.

Atomic operations were introduced when multiple processes append to the same file and when multiple processes create the same file. We also looked at the data structures used by the kernel to share information about open files. We’ll return to these data structures later in the text.

We also described the ioctl and fcntl functions. We return to both of these functions later in the book. In Chapter 14, we’ll use fcntl for record locking. In Chapter 18 and Chapter 19, we’ll use ioctl when we deal with terminal devices.

参考

《Advanced Programming in the UNIX Envinronment,》Third Edition Chapter 3. File I/O
重定向和管道，第 3 章命令行简介 http://man.chinaunix.net/linux/mandrake/101/zh_cn/Command-Line.html/shell-pipes.html

File I/O

File I/O

Introduction

File Descriptors

open and openat Functions

Filename and Pathname Truncation

creat Function

close Function

重定向和管道的命令行简介

重定向

管道

lseek Function

Example

Example

read Function

write Function

I/O Efﬁciency

File Sharing

Atomic Operations

Appending to a File

pread and pwrite Functions

Creating a File

dup and dup2 Functions

sync, fsync, and fdatasync Functions

fcntl Function

Example

Example

ioctl Function

/dev/fd

Summary

参考

附件列表