8. File Access and Processing – C Programming Essentials

Chapter 8. File Access and Processing

Any commercially available programming language is able to communicate with disc data files. This communication process depends on the operating system under which the language is used. UNIX is the operating system under which C was developed originally. Nowadays, C compilers are available under most of the major operating systems. In fact, most of the I/O functions in C behaves in a consistent way across operating systems, though all the operating systems do not provide all the facilities available in UNIX. In this chapter, we will discuss how data disc files are used in C.

Introduction

There are two levels with which reading and writing data stored on disc can be accomplished. They are high-level disc I/O and low-level disc I/O. High-level disc I/O is designed to work with the data files and is convenient when small and manageable amounts of data are to be processed. They are also independent of operating systems. In fact, the functions for high-level access are built up from low-level functions. On the contrary, low-level disc I/O functions are operating system-dependent. They read data from disc in blocks, convenient for the underlying operating system, and obviously different for different operating systems.

Opening and Closing a Disc File

In order to use a disc data file, either for reading or writing, we must first open the file. At the time of opening the file, and when the file is open, the program and the underlying operating system shares certain information regarding the file. This information is stored in a structure, which is declared within the header file stdio.h supplied with the compiler. This declaration is somewhat similar to the following:

#define OPEN_MAX 20    /*Max number of files open at a time*/
typedef struct _iobuf{
      int _fd;         /*File descriptor*/
      int _cnt;        /*Characters left in buffer*/
      int _mode;       /*File access mode*/
      char _next;      /*Pointer to next character*/
      char* _base;     /*Base address of file buffer*/
} FILE;
extern FILE _iob[open_max];

The symbolic constant OPEN_MAX limits the maximum number of files that can be opened at a time in a particular program. This value of OPEN_MAX depends upon the compiler. A program gets the information about a particular data file from the structure named FILE. The header file, stdio.h, defines an array of such structures, namely, _iob of size OPEN_MAX, to keep the information about all the files that are opened by a program.

Although the structure FILE is shown explicitly, a programmer need not keep track of the individual members of the structure. The programmer may open a disc data file by calling the fopen( ) function. The function fopen( ) is called with two arguments, which are as shown below:

  1. The filename — a string of characters or a pointer to the file name

  2. The file access mode — a string

Essentially, the function fopen( ) performs two tasks. Firstly, it stores the information that will be shared by the program and the operating system into the FILE structure, and then returns a pointer to FILE structure where it is stored. This pointer is known as file pointer and is used as the logical name of the opened file.

While opening a data file, the compiler needs the following:

  1. The filename we want to use

  2. The mode of using the file, i.e., what type of work we want to do with it

  3. Where to store the relevant information about the file

An actual call to this function looks like:

fp = fopen("student.dat","r");

This opens the disc file called student.dat in read “r” mode, i.e., for reading purpose, and stores the information about the file within a structure FILE and returns the file pointer which is collected in the variable fp. So, before calling fopen( ), the file pointer fp is to be defined as:

FILE *fp;

There are three different modes of operating a file. These modes are:

  • “r” for reading from a file

  • “w” for writing to a file

  • “a” for appending data to a file

In addition to these three modes, two modifications may also be attached. To use a binary file, the modifier “b” is attached. A plus (+) modifier is used to open an existing file for updating. For example, if the mode of opening a file is “ab+” or “a+b” (they have the same meaning), it means we are opening a binary file for appending or (if it does not exist, create and) open for updating at the end of the file. A list of different modes of opening a file is given in Table 8.1.

Table 8.1. Modes for Operating a File

Mode

Interpretation

“r”

Opens a file for reading

“w”

Creates a file for writing or truncate to zero length

“a”

Opens a file for writing at end or create for writing

“r+”

Opens an existing file for reading/writing

“w+”

Truncates to zero length or create file for updating

“a+”

Opens for updating at end

“rb”

Opens a binary file for reading

“wb”

Opens a binary file for writing

“ab”

Opens a binary file for writing at end

“rb+”

Opens an existing binary file for reading/writing

“wb+”

Creates a binary file for updating/writing

“ab+”

Opens a binary file for appending at end

It should be noted that if an error condition like ‘disc full’ while writing, or ‘open drive door’, or any other hardware failure occurs at the time of opening a file, then fopen( ) returns a NULL pointer, which is defined as a symbolic constant within the stdio.h header file. So, in practice, when a file is opened, the file pointer is checked against NULL pointer before using it.

Consider the following code fragment:

...

FILE *fpin,*fopen();

if(fpin=fopen(filename,"r") == NULL){

      printf("Cannot open %s file\n",filename);

      return;
}
...

This opens a file with FILE pointer fpin and while opening the file, it tests whether it is a successful open or not. Assume that the file is opened with FILE pointer fpin to be used for reading a file. Similarly, let us assume that a file with FILE pointer fpout is opened, which will be used for writing to a file. The functions may be used to read a character from the file pointer fp as:

 i)  int fgetc(FILE *fp);
ii)  int getc(FILE *fp);

The function fclose( ) is just the converse of fopen( ). It disconnects the file pointer from the external name of the file that was established by fopen( ) function. When the work with a file is over, that is, we do not need the file any more, we must close the file. This is done by calling the fclose( ) function with the file pointer as argument to it, as shown below:

int fclose(FILE *fp);

Character Input/Output

Character input/output and strings of character input/output facilities are provided by the character input/output functions.

For the segment of code:

int c;
...
...
c=fgetc(fpin);

The function fgetc( ) reads the next character from the FILE structure pointed to by fpin and assigns it into the variable c. Essentially, fgetc reads in a character (as unsigned char) from the specified stream and returns it as a value of type int (if it is successful) or returns EOF (if it is unsuccessful or the end_of_file has been encountered). The function getc( ) is same as fgetc( ), except for the fact that getc( ) is usually implemented as a macro instead of a function.

At this point of time, we should note that when a C program starts executing, the operating-system environment automatically opens three files and provides file pointers for them. These files are standard input, standard output, and standard error. The corresponding file pointers are stdin, stdout, and stderr, respectively. Normally, stdin is connected with the standard input device (usually the keyboard) and stdout is connected to the standard output device (usually the terminal screen). However, stdin and stdout may be redirected to files or pipes. The file pointer stderr is connected to the terminal screen. The function getchar( ) is identical to getc( ) with argument stdin, and can be defined as:

#define getchar() getc(stdin)

There is another useful function, ungetc( ), which allows us to push back the character just read from a stream back to the stream, so that the next read from the stream reads the previous character again. This is not really done by pushing back the character to the stream physically. Instead, the character is placed in a buffer to serve the purpose.

Just as read functions, the following functions may be used to write the character specified by the integer c to the stream fp at the next pointer position of the file. This also returns the same character as int.

 i)  int fputc(int c,FILE *fp);
ii)  int putc(int c,FILE *fp);

Again, the function putc( ) behaves like fputc( ) except for the fact that it is implemented as macro instead of function. In fact, putchar( ) may be defined as:

#define putchar(c) putc((c),stdout)

The input and output of strings of characters may be achieved by using the following functions:

 i)  char *fgets(char *buff, int buff-size, FILE *fp);
ii)  int fputs(const char *buff, FILE *fp);

As a file read/write example, we write a simple program that reads a source text file and writes to another file the same source with line numbers. The program is listed in the following example program.

Example . 

Program Ch08n01.c

 /*A File read-write example*/

 #include <stdio.h>
 #include <process.h>

 int main()
 {
    FILE *rfp,*wfp;
    char buff[80];
    int lineno=0;

    /*Open input and output files*/
    if((rfp=fopen("inpfile.c","r"))==NULL)
    {
        printf("Cannot open inpfile.c for reading !\n");
        exit(1);
    }
    if((wfp=fopen("outfile.c","w"))==NULL)
    {
        printf("Cannot open outfile.c for writing !\n");
        exit(2);
    }

    /*File read and write*/
    while(fgets(buff,80,rfp)!= NULL)
    {
        fprintf(wfp, "%4d:",++lineno);
        fputs(buff,wfp);
    }
    fclose(rfp);
    fclose(wfp);
    return 0;
 }
 End Program

Explanation: The program starts by defining two file pointers, one for input file and another for output file. A character array buffer of size 80 is defined to store the line that is read currently from the inpfile.c, which is the external name of the input source file. An integer variable lineno is defined and initialized with zero to keep the line count. The first two if blocks are used to open the input and output files with proper mode. The standard library function exit( ) is called to terminate the program execution. The argument to the function exit( ) is returned by the function and is available to the process that calls this program. Hence, the success or failure of the program may be tested by another program that uses this program as a sub-process. The while loop does the actual job of the program. The call to fgets( ) reads the next line (a maximum of 79 characters) from the input file into the buff array; fprintf( ) writes the current line number to the output file; and then the call to fputs( ) writes the buff array into the output file. This will continue till there are valid lines in the input file. Finally, two calls to the fclose( ) function are made to close the input and the output files.

As our next example, we write a program, the task of which is very similar to that of the cat command in UNIX. This uses reading and writing a file character by character, and inside the program we make use of the functions fgetc( ) and fputc( ). These functions are called from the function copyfile, which we wrote to copy a file from the file whose file pointer is rfp, to the file having file pointer wfp. This is a simple function that reads a character from rfp stream and writes to the stream wfp until the end of file. The program listing is given in the next example program. The program uses command-line arguments. For reference, the cat command is of the form:

cat filename1 filename2...

If there is only one argument in the command, the program displays the line that is just typed through the keyboard. For more than one argument in the command line, the program opens the files (from the command) one by one and displays it.

Example . 

Program Ch08n02.c

 /*Program listing for simulating the cat command in UNIX*/

 #include <stdio.h>
 #include <process.h>

 void copyfile(FILE *, FILE *);

 int main(int argc,char *argv[])
 {
     FILE *fileptr,*fopen();
     if(argc>1)
         while(--argc>0)
             if((fileptr=fopen(*++argv, "r"))==NULL)
             {
                 printf("cat : can't open %s\n",*argv);
                 exit(1);
             }
             else {
                 copyfile(fileptr,stdout);
                 fclose(fileptr);
             }
     else
        copyfile(stdin, stdout);
     return(0);
 }

 /* Function to copy file from rfp to wfp */
 void copyfile (FILE *rfp,FILE *wfp)
 {
     int c;

     while((c=fgetc(rfp))!=EOF)
          fputc(c, wfp);
     return;
 }
End Program

Error-Handling

Although the cat program in the last section seems to run properly, it still has some problems. The difficulty is that if one of the files in the command line cannot be opened for some reason, the error message is printed at the end of the output and the program terminates. This might be acceptable if the output is going to the standard output device (the display unit), but is not reasonable if it is going into a disc file or into some other program through a pipeline.

To rid ourselves of this problem, we can make use of another output stream called stderr, which is again assigned to the screen. This is the file pointer of the standard error and is opened automatically as stdin and stdout, when a C program starts exearting. The advantage of using this is that the output written on stderr appears on the screen even if the standard output is redirected. We rewrite our cat program to write the error messages on the standard error device. The program listing is given in the following example program.

Example . 

Program Ch08n03.c

 /*Program listing for simulating the cat command in UNIX with error
 handling*/

 #include <stdio.h>
 #include <process.h>

 void copyfile(FILE *,FILE *);

 int main(int argc, char *argv[])
 {
     FILE *fileptr, *fopen();

     if(argc>1)
         while(--argc>0)
             if((fileptr=fopen(*++argv, "r"))==NULL)
             {
                 fprintf(stderr,"cat : can't open %s\n",*argv);
                 exit(1);
             }
             else{
                  copyfile(fileptr,stdout);
                  fclose(fileptr);
             }
     else
         copyfile(stdin, stdout);
     if(ferror(stdout))
     {
         fprintf(stderr, "cat : error in writing at stdout\n" );
         exit(2);
     }
     return(0);
 }
 /* Function to copy file from rfp to wfp */
 void copyfile(FILE *rfp, FILE *wfp)
 {
     int c;

     while((c=fgetc(rfp))!=EOF)
            fputc(c, wfp);
     return;
 }
 End Program

Explanation: Notice that in the above example the diagnostic output goes to stderr by fprintf. So, it will be directed to the screen instead of any redirected output file or a pipeline. The failure or success of the program may be tested by another program that uses this program as a sub-process since we called the exit( ) function from the program with parameters to indicate it. It should be noted that a return value of 0 from the exit function indicates a success, while non-zero values are treated as failures as a convention. The function exit( ), in turn, calls fclose( ) for each open file in the program to close them.

We used another function ferror( ), whose general form is:

int ferror(FILE *fp);

This function returns a non-zero value if an error occurred on the stream fp. There is another function analogous to ferror( ). It returns a non-zero value if end of file has occurred on the specified stream, and is given by:

int feof(FILE *fp);

Reading and Writing a File in Convenient Chunks

While reading or writing a binary file, it does not make much sense to read or write the file one character or line at a time. Instead, in such situations we want to read or write a certain number of data items of specified size. To achieve this we may use the following functions:

i)   size_t fread(char *buff,size_t size,size_t count,FILE *fptr);
ii)  size_t fwrite(char *buff,size_t size,size_t count,FILE *fptr);

Here, size_t is the typedef for the data type returned by the sizeof operator, and is defined in the header file stddef.h. The function fread( ) reads from the specified stream fptr, counts the number of items of size_t bytes each, and stores from the memory beginning at the address of buff. The return value from fread( ) is the number of items (count) read. The return value is less than count if an error occurs or end of file is reached.

The function fwrite( ) is just the converse of fread, and writes count items each of size bytes from the memory address buff to the file stream fptr. Here, the function also returns the number of items written, which may be less than count if an output error occurs. In both cases, the file-position indicator is advanced by the number of characters successfully read or written.

File Positioning

Primarily, there are two ways to access a file: sequentially or randomly. In sequential access, to access a specified position of a file, we must access all the preceding data before it in a sequential fashion. This means, if we want to access the 5th (say) record of a file, we must access records from 1 to 4 in a sequential manner, and only then we can have access to read the 5th record. On the other hand, in random access, we move the file-position indicator to the beginning of 5th record and then access the record without accessing the previous records. That is, random access allows us a direct access to a specified portion of the file. Now, it should be clear that to use random access files, we need some functions that will allow us to move around in a file. These functions are known as file-positioning functions.

There are three such functions whose descriptions are given below:

i)    int fseek(FILE *fptr,long offset,int base);
ii)   long ftell(FILE *fptr);
iii)  void rewind(FILE *fptr);

The function fseek( ) positions the file-position indicator that is referenced by the stream fptr to offset bytes from beginning, from the current position of the file-position indicator, or from the end of file, if the value of base is 0, 1, or 2, respectively. If a call to the fseek( ) function is successful, fseek( ) returns a 0; otherwise, it will return a non-zero value.

The function ftell( ) returns the offset of the current file-position indicator in bytes from the beginning of the file, referenced by the stream fptr. It returns -1 in case of any error. Notice that the data type of these returned values is long int.

The function rewind( ) repositions the file-position indicator in the file referenced by the stream fptr to the beginning of the file and returns nothing. We write a program as the next example to illustrate the functions fseek, fread, fwrite, and rewind.

Example . 

Program Ch08n04.c

 /*Program code to retrieve user specified lines
 from a text & store these lines to another destination file */

 #include <stdio.h>
 #include <process.h>

 #define MAX 500
 void makendx(FILE *rfp, FILE *nfp)
 {
    long line=0, spot=0;
    int ch;

    /* leave space for storing total number of lines in file*/
    fseek(nfp,(long)sizeof(long),0);

    /* Write the position of the first line*/
    fwrite((char*)&spot, sizeof(long),1,nfp);

    /* loop to store the positions of the subsequent lines */
    while((ch=fgetc(rfp))!= EOF){
           ++spot;
           if(ch=='\n'){
                  ++spot;
                  line++;
                  fwrite((char*)&spot,sizeof(long),1,nfp);
           }
    }

    /* Go to the top of the index file and store total lines*/
    rewind(nfp);
    fwrite((char*)&line,sizeof(long),1,nfp);
    fclose(nfp);
    return;
 }

 void makeout(FILE *rfp, FILE *wfp, FILE *nfp)
    {
    long lines,lineno,start,end;
    char chunk[MAX];

    /*reopen index  for reading*/
    if((nfp=fopen("tmp.ndx","rb+"))==NULL){
           fprintf(stderr,"Cannot open tmp.ndx\n");
           exit(5);
    }

    /* read total numbers of lines in source*/
    fread((char*)&lines,sizeof(long),1,nfp);

    printf("Enter line numbers to retrieve one by one:\n");

    /* loop to get desired line and store to destination*/
    while(1){
           scanf("%ld",&lineno);
           if(lineno==0)
                  return;

           if(lineno>lines)
                  continue;

           /* set file position indicator to get starting position*/
           fseek(nfp,lineno*sizeof(long),0);

           /* read starting position*/
           fread((char*)&start,sizeof(long),1,nfp);

           /* read starting position of next line to count number of
           characters in the current one*/
           fread((char*)&end,sizeof(long),1,nfp);

           /* position the indicator to the start of source*/
           fseek(rfp,start,0);
           /*read desired number of character*/
           fread((char*)&chunk,1,end-start-1,rfp);

           /* write the line to destination file*/
           fwrite((char*)&chunk,1,end-start-1,wfp);

    }
    return;
 }

 int main(int argc, char *argv[])
    {
    FILE *rfp, *wfp, *nfp;
    if(argc!=3){
           fprintf(stderr,"Usage : %s <source> <destination>\n",argv[0]);
           exit(1);
    }
    if((rfp=fopen(*++argv,"r"))== NULL){
           fprintf(stderr,"Cannot open %s \n",*argv);
           exit(2);
    }
    if((wfp=fopen(*++argv,"w"))==NULL){
           fprintf(stderr,"Cannot open %s\n", *argv);
           exit(3);
    }
    if((nfp=fopen("tmp.ndx","wb+"))==NULL){
           fprintf(stderr,"Cannot open tmp.ndx\n");
           exit(4);
    }
    makendx(rfp,nfp);
    makeout(rfp,wfp,nfp);
    fclose(rfp);
    fclose(wfp);
    fclose(nfp);
    return 0;
 }

One can use the function ftell( ) inside the program to get the current file-pointer indicator as well. Ordinarily, this function is used at the time of debugging a program. Our example program retrieves user-specified lines from a text file. To execute the program, we need to submit a command of the following form in UNIX:

$ Ch08n04.out <source_file> <destination_file>

Here, the <source_file> is the input text file from which we want to retrieve the lines. The <destination_file> is the file within which we store the retrieved lines one by one. The main function of the program opens the <source_file> in read(“r”) mode; the destination file in write(“w”) mode; and an index file, which is a temporary binary file, in the mode to create it for updation (“wb+”). On successful opening of these files, main function calls to functions makendx( ) and makeout( ) one after the other. The function makendx( ) prepares an index file by scanning the <source_file> once at the beginning. Each record of this index file is of sizeof(long) types in length. The first record of it holds the number of lines in the source text file, and the subsequent records hold the starting positions of each line. The makeout( ) function performs the following task. When a particular line (by giving the line number) is to be invoked, an fseek is done to position the file-position indicator of the index file to get the starting position of the line to be read. After reading this starting position from index file, another fseek is done on the source text file at this starting position. Thereafter, a line is read by reading appropriate number of characters from this position of file and is stored in a character array. Then, this character array is dumped into the output file.

Summary

The file I/O mechanisms available in C are through the usage of functions. It is easy to use files because operations on files are almost similar to console I/O operations. In fact, file I/O functions like fprintf, fscanf, fgetc, and fputc are replicas of their console counterparts except for an additional parameter indicating the file referred.

Random file access is achieved using the fseek function, which positions the file pointer to a specific position in the file.

The reader is expected to write programs involving files as much as possible, so that a certain degree of ease is attained with respect to file manipulation.

New Terminology Checklist

Access

Error

Disk

Operations

Exercises

1.

Write a program to convert all the characters in a text file to uppercase.

2.

Write a program to count the number of words and characters (excluding spaces, tabs, and new lines) in a file.

3.

Write a program to concatenate two input files into a resultant output file.

4.

Write a program to search for a pattern in a file.

5.

Write a program to find the number of occurrences of the most frequently occurring character (excluding whitespaces) in a text file.

6.

Repeat the above exercise for counting the top-three most frequently occurring words in a text file.

7.

Write a program that compares two files given in command line. If they are identical, the program prints a proper message; otherwise prints the line number and column position where the files differ for the first time.

8.

Write a program to output the common words in two given text files.

9.

Consider that a C program is to be executed where the input stream is redirected to a disk file. How the program will understand that the input is redirected to that specified file?

10.

If we attempt to do I/O on a stream that is not opened properly, what is expected to happen?