..  _file-input-output:

#################
File Input/Output
#################

..  include::   /header.inc
..  vim:filetype=rst spell:

We now have a decent set of tools for solving problems that are more
realistic than the simple ones we have studied up to now. We have one more
basic concept to get into, though - reading and writing *data files*. 

In the real world (wherever that is!) data sets that you need to process can
get huge. In my supercomputer days, we regularly processed data sets 10
gigabytes big (or more). Often, these data sets were created by other programs
or by experimental systems, and needed to be analyzed to figure out what went
on during a test. In order to understand how we can deal with problems like
this, we need to understand how a computer can access that thing we call a
*file*! 

What is a File, anyway?
#######################

Basically, a *file* is just a sequence of bytes of data written onto a disk and
given a name so the operating system can find it. We will *stream* those bytes
into our programs, putting them someplace (like in an array, perhaps) and
*stream* them back out to be written into another file when we are done. 

What the *stream* of bytes represents is very problem dependent, and we will
need to teach our programs how to break up the stream into chunks that make
sense. For instance, a stream of integer numbers would be broken every four
bytes (if integers are 32 bits long in our system). Such a file would be called
a *binary* file since the data is really written out in binary form. Other
files may be made up of *character* data. In this case, the stream is just a
bunch of characters written one character per byte. Files contains streams of
characters are called *text files* for obvious reasons. 

Character steams get treated a bit differently in the computer world, since
they get processed so often. We usually think of streams of characters in terms
of lines of text broken up by some kind of end-line marker. Unfortunately, the
exact end-line marker differs between systems (ask Microsoft why) but we will
not need to worry about this unless we move a file from Windows to Linux, for
example. 

Accessing a File
################

So, how do we access a file? Well, we have been doing something like this all
along, anytime we use the ``cout`` and ``cin`` features of the C++ language.
These two constructs are more properly called *character stream* routines - one
for pulling characters into a program, and one for pushing characters out of a
program. The **cin** stream is pulled in from a weird file called the
*keyboard* which is attached to YOU! Only you know what the stream will
contain! The **cout** stream is attached to the console, and gets displayed on
a funny black background window displayed when you run the program. 

The two operators you use are really called "stream operators":

    * "<<" inserts characters into the output stream

    * ">>" extracts a stream of characters from the input stream

Console Streams
===============

Your operating system will connect your program to one of these files, and you
can access that file by streaming data into or out of your program. The special
files ``cin`` and ``cout`` are streams that are unique in that they are attached to
physical devices on your system, and those devices either produce a stream or
accept a stream. There are many physical devices that act this way. Soon, you
will think of data moving from place to place as a pretty natural thing to do,
and start think up clever ways to do that movement. 

..  note::

    Throughout history humans have been thinking up ways to send data from
    place to place. It may have been a message wrapped up in animal skins on
    some rock, thrown to another human, or attached to a  arrow to get it to
    move further. We used campfires and blankets to send signals, using light
    or smoke. Then we got smarter and used lights we could turn on and off in
    some pattern. We figured out how to encode data some way. When electricity
    came along, we moved signals over wires using morse code. Today, we
    "stream" gazillion bits of data, just so we can watch some silly movie on
    our iPhone. Progress, right?

Your program does not really care about all of that. In fact it is common to
tell the operating system to stream data into your program through ``cin`` from
a text file, not a human. You do that this way:

..  code-block::    bash

    $ myprograam < datafile.txt

This is called "redirection, and, you can tell the operating system to
"redirect" the output of a program to another file, not the console:

..  code-block:: bash

    mydata > savedata.txt

If you want, you can do both at the same time:

..  code-block:: text

    $ myprog < mydata.txt > savedata.txt

Cute!

It gets even cooler when you discover that one program can feed another one. If
you send the output from one program into the input of another, you can build
tools out of small programs that do simple things to the stream moving through
them. This was a key concept in designing the Unix operating system:

..  code-block:: bash

    $ prog1 | prog2

We call that vertical bar a "pipe" connecting the two programs together. The
operating system launched both program at the same time and connected the
streams to the console and to each other as needed.

Reading, Writing
================

(No Arithmetic here!)


We call the process of accessing an existing file "reading". When we send data
to a file, usually a new file, it is called  "writing". If you want to tack on
data on the end of an existing file, it is called "appending". Most of the time, we always
write on the end of any file, not in the middle. If you started writing over data
in the middle of a file, you might end up making a mess, so we rarely do that
unless we really understand how the data file is laid out. We will not explore
that kind of file access here.

It is important to think about which way you want data to move relative to our
programs. We "read" data into our programs from an input file stream, and
"write" data from our programs to an output file stream. 

..  note::

    That header file you always add to your program is called ``iostream``,
    right? What do you think is going on in that package of code?

Before we can do either, we need to open the files! 

..  note::

    Your operating system automatically "opens" access to the console. These
    files are called "standard input" (``cin``) and "standard output"
    (``cout``).

Opening a file
==============

The first step we need to perform to access a file is to ask the operating
system to hook us up with one (or more). To do this, we will need to come up
with a name for the file we want to work with. This file can be new - meaning
we want to create it, or it might already be in the file system the operating
system controls, meaning we need to know its name. The name for a file can be
anything the operating system allows - including a full path name which on
Windows systems could be something like c:\\COSC1337\\lab11\\data.set. (For
reasons known only to M$, you need to use backslashes for path
separators in C++ - Unix uses single forward slashes). 

..  note::

    C++ will accept forward slashes on any system, so most folks write code that way.

..  warning::

    If you just enter a name with no path information, the file is assumed to
    live in the same folder as the program executable file. (Yikes - another
    file we could read - wonder what that one looks like! Only the loader knows
    for sure!) If the file in in some subdirectory insde of that folder, add
    that folder name to the name you use in your code:

    `
    * docs/README.rst

    Never write code with file names hard-coded to your system, unless you are sure your code will never run anywhere else. I see students do that all the time, and when I try to run their code on my system, nothing works properly! YMMV!


We also need to tell the operating system which way our stream will go - in or
out. (It is possible to do both at the same time, but that is something you
learn in a later course!) 

Here is how we do this magic in C++: 

..  code-block:: c

    #include <fstream>

    ifstream inFile;
    ofstream outFile;
    
    inFile.open("test.dat", ios::in);
    outFile.open("testcopy.dat", ios::out);
    
    if(inFile.is_open()) {
        // process the input file
    }
    
    if outFile.is_open()) {
        // process the output file
    }

..  warning::

In the above example, we create two variables which are very different critters
from those we have been building up to now. These variables (more properly -
*objects*) will be connected to files by the operating systems. They are weird
kinds of boxes that can get really big! We include the ``fstream`` header now,
because we intend to access real files, not the standard ones we get with
``iostream``. ``fstream`` is a class, so we create objects from that class as
usual. In this example, er create one for inut, and one for utput.

The **ifstream** and **ofstream** are the individual (inherited) classes that
derive from the top-level ``fstream`` class.  The **if** stands for *input
file*, and the **of** stands for *output file*. Make sense? 

..  note::

    You never use the ``fstream`` class directly, use either of the sub-classes
    to set up a file with the right direction ready to go.

Once we have created these variables, we hook them up to the operating system
files by calling the **open** method associated with those new file variables.
The **open** function interacts with the operating system and sends it the name
of the file we want to work with, and information needed to say which way we
want data to move. (In both of these open functions the **ios::in** and
**ios::out** stuff is actually not needed, since we already said which way we
want to go when we created the variables. We will need to include something
here later, though - so stay tuned! 

There is always the possibility that something can go wrong when you try to
open files for access. If you want to open a file for input, you expect the
file to already exist. What happens if it does not? The operating system should
tell you that so you can handle that situation. The **inFile.is_open()**
method returns a *boolean* function telling you if the file was found and
successfully opened for reading. The **outFile.is_open()** function says the
same thing for an output file. In this case, maybe there ass no room for a new
file, or you are trying to write some place where you are not allowed to write. 

Reading, Writing, and Arithmetic
================================

Now that we have files available, reading and writing (and a little math)
should look familiar to you: 

..  code-block:: c

    int mydata;
    
    inFile >> mydata;
    mydata = mydata * 5;
    outFile << mydata << endl;


What has changed here? Simply the name of the variable we need to use. We use
**inFile** here, where we used **cin** before, and we use **outFile** here,
where we used **cout** before. In both cases, we will be working with streams
of characters. The **inFile** variable expects to find characters in the
stream, and will convert what it finds to whatever data type you tell it to
read (an integer in this case). On output the **outFile** variable will take
whatever data container you specify, and convert the data contained in that
container into a sequence of characters and send it to the output file. Pretty
simple, huh? 

What about the end of the file?
===============================

We do have a bit of a problem now, that we did not have earlier. What happens
when the program runs out of stuff to read when asking for data. With **cin**,
you were the source of the data, and you never run out of stuff to say! If the
file we are reading from runs out of stuff - it does have a real end! - we need
a way to figure this out. There is another method that we can use -
**eof()**. We will see this in action in the next example: 

Let's copy a file!
##################

Just to do something simple, lets open a file for input, another for output,
and make a copy of the file! 

..  code-block:: c

    #include <cstdlib>
    #include <iostream>
    #include <fstream>
    
    using namespace std;
    
    int main(int argc, char * argv[])
    {
        cout << "File Copy Program" << endl;
        ifstream inFile;
        ofstream outFile;
        
        char ch;
    
        inFile.open("mymain.cpp");
        if (inFile.is_open()) {
            // we have a file for reading
            outFile.open("newcopy.cpp");
            if(outFile.is_open()) {
                //we have a file for writing
                inFile.get(ch);
                while(!inFile.eof()) {
                    outFile << ch;
                    inFile.get(ch);
                }
                outFile.close();
            } else {
                cout << "Cannot open output file" << endl;
            }
            inFile.close();
        } else {
            cout << "input file not found" << endl;
        }
    }


Boy, that is a lot of work just to copy a file. Well, it is, but it shows
several patterns we will want to use when we play with files! 

Notice how I set this code up. For each of the files, I checked if the file was
successfully opened before proceeding to do the copy. If not, I write out an
error message on the console so I can figure out what happened. Here is what
the program generates running as it stands now: 

..  code-block:: bash

    ./demo
    File Copy Program
    input file not found

That makes sense, I did not have a file called **mymain.cpp** in the current
directory where this program ran! Let's change the file name to **main.cpp**
(assuming you saved your project file as **main.cpp**) and see what happens: 

..  code-block:: bash

    ./demo
    File Copy Program
    Press any key to continue . . .    


Hmmm, what happened? Check the directory where your project lives. Do you see a
new file called **newcopy.cpp**? If so, check it to see what it contains! 

File Read Access Patterns
#########################

The most important pattern we see in the above example is the one we use to
safely read data from a file. In the example above, rather than use the funky
**>>** notation to read data from a file, I used **inFile.get(ch)**, which
reads a single character from the input file. (This avoids a situation where
C++ is too smart for its own good when we are just trying to read a bunch of
characters!) 

The pattern here wants to make sure we have something useful to read from the
file. Think about what could happen when you read from the file. The file could
be missing, it could be empty, or it could have good stuff to read. What will
we do in each case. 

Checking for file missing
=========================

We handle the first case by making sure we get a good file open action. The
**is_open** function will tell us if we got the result we want. 

Checking if we have a character to read
=======================================

The only way we will be able to see if we have a character to read is to try it
and see! C++ sets up an error condition if it is asked to read something from a
file and fails. We can check this error condition to make sure we got what we
were after. So, the pattern we want to use to read however much stuff is in the
file looks like this: 

..  code-block:: c

    inFile.get(ch);
    while(!inFile.eof()) {
        // do something with ch
        inFile.get(ch);
    }

This is called a *seeded loop* since we *seed* the loop by trying our first
read outside the loop. If we read a good data item, the system will not set an
error flag. The function **eof()** - End of File - will tell us if we hit the
end (or never had anything to begin with!) If we have a good data item, we
enter the loop and immediately process that data item. After we finish, we try
for the next data item. Once again, if we run out of stuff to read, the *end of
file* error flag will be set and we will see that when we loop back and check
it with **eof()** again. You should see how this will handle both the nothing
to read situation, and the general situation where we have an arbitrary amount
of stuff to read. 

The copy routine above just reads characters from the input file, then turns
around and writes them back out to the output file. Everything gets copied -
even the end of line markers (which we see internally as a **\n** character.
Once we hit the end of the data, we close both files, which disconnects them
from the program and tells the operating system to finish up for us. 

If we fail to close files, the C++ system will do that for us when the program
ends. However, if your program bombs and you do not get to the normal end, you
may end up with files in an undefined state. Normally, the operating system
will keep you out of trouble, but you might need to clean up after yourself if
things go really wrong! 

File Write Access
=================

Normally, you will not run into problems writing to a file that you succeed in
opening. The operating system will create a new, empty, file with the name you
specify when it opens the file. Once that has been done, you write data to that
file until you are finished. The data is streamed to the output file, like
writing on a tape. You cannot back up and undo anything. (It is possible to
back up, but we will not worry about that here.) 

If the file you name for output already exists in the operating system files,
the old copy of that file gets zapped and a new one is set up just like
before. For this reason, you need to be careful that you do not kill a file you
really want to keep around. Editing programs never write to the file they are
working on (they create a new working file to play with) so they do not kill
the old file you are working on before they save the new changes safely away.
In this way, you can always kill the editor and not lose the original file. 

What about adding stuff to a file?
==================================

Sometimes, we want to add data to the end of an existing file. We can do this
very easily by using a different kind of open command. It looks like this: 

..  code-block:: text

    outFile.open('datafile.txt',ios::app);


Here the **ios::app** says open the file for appending, and anything you
write to this file will be added to the end of the previous contents. 

Sorting an Array of Names
##########################

As an example of using files, let's read in a list of names, and sort them.
Then, let's write out the names in order. This example is going to use code we
have discussed in the last few lectures. 

As usual, we will build up the program in baby steps! 

Step one - getting started:
===========================

Here is our standard starting point: 

..  code-block:: c

    #include <cstdlib>
    #include <iostream>
    
    using namespace std;
    
    int main(int argc, char * argv[])
    {
        cout << "Name Sorter Program" << endl;
        
        system("PAUSE");
        return EXIT_SUCCESS;
    }


I am not going to run this, you know what it will do! Next, let's build a data
file to use. Create this file and save it as **names.dat** in the folder with
your project (you can use Dev-C++ to do this). 

..  code-block:: bash

    Flintstone, Wilma
    Flintstone, Fred
    Rubble, Barney
    Dinosaur, Dino


Hmmm, how am I going to read this? Well, we saw an example of reading a bunch
of characters earlier, can we read a bunch of characters into some string
variables (like an array?) Sure! We need to set up the input file and string
variables first: 

..  code-block:: c

    #include <string>
    #include <fstream>
    
        ifstream inFile;
    
        string line;
    ...
        inFile >> myline;
        while(!inFile.eof()) {
            cout << myline.length() << " " << myline << endl;
            inFile >> myline;
        }
        inFile.close();
    ...


This fragment sets up the input file, then starts off by trying a line read
using the normal **>>** operator. It then just prints out what it read, and
tries again. The loop will keep this up until the inner attempt to read fails,
at which time, we leave the loop. 

..  code-block:: bash

    Name Sorter Program
    16 Flintstone,Wilma
    15 Flintstone,Fred
    13 Rubble,Barney
    13 Dinosaur,Dino
    Press any key to continue . . .


Now, we need to store the names somewhere - like in an array! 

..  code-block:: c

    string names[25]
    int i = 0;
    ...
        cout << myline.length() << " " << myline << endl;
        names[i++] = line;
    }
    cout << "You read " << i << " names." << endl;


This is an interesting pattern. We are in a loop, reading lines. We have set up
an integer (initialized to zero on the declaration) and we use that integer to
index into the **names** array. As we save the current line into the right
spot, we use a *post increment* operator on the index. What does this do? 

Well, the real meaning of the i++ notation is this: 

* take the current value out of i and use it in this spot 
    * That means we save the line string in index names[i] 
* When you get through using that value, increment it by one and put it back 
    * That means when the smoke clears, i will be one bigger, for the next pass! 

Just the kind of behaviour we want. When the loop ends, **i** will be the
number of names we read! 

..  note::

    Now you know what the name C++ really means - it is one better than the
    old language `C`!

..  code-block:: c

    Name Sorter Program
    16 Flintstone,Wilma
    15 Flintstone,Fred
    13 Rubble,Barney
    13 Dinosaur,Dino
    You read 4 names.
    Press any key to continue . . .


Looks like we have the input side handled. Let's prove it by displaying the
names in the string array: 

..  code-block:: c
    
    for(int j=0;j<i;j++) {
        cout << "Name " << j << " " << names[j] << endl;
    }


..  code-block:: bash

    Name Sorter Program
    16 Flintstone,Wilma
    15 Flintstone,Fred
    13 Rubble,Barney
    13 Dinosaur,Dino
    You read 4 names.
    Name 0 Flintstone,Wilma
    Name 1 Flintstone,Fred
    Name 2 Rubble,Barney
    Name 3 Dinosaur,Dino
    Press any key to continue . . .


We are making progress. We have read the names into the array, and we can print
them out. Now, let's change the output part to write to our final file: 

..  code-block:: c

    ofstream outFile;
    
    outFile.open("SortedNames.dat");
    ..
    
    for(j=0;j<i;j++) {
        outFile << names[j];
    }
    outFile.close();

This is what we get in the file after running the program:

..  code-block:: bash

    Flintstone,WilmaFlintstone,FredRubble,BarneyDinosaur,Dino


Tilt! It looks like we lost the end of line markers. We need to place one in
the output file ourself (by adding an **endl** to the end of the output line.
Then we get what we want: 

..  code-block:: bash

    Flintstone,Wilma
    Flintstone,Fred
    Rubble,Barney
    Dinosaur,Dino


That worked just fine. Only, we wanted to sort these lines, and all we did was
copy them! On to the next part! 

Now, for the sorting
====================

Here is a chunk of code that can sort an array of strings. It is called
*BubbleSort* since it works by looking at all the names in the array and
*swapping* them if they are out of order. This is pretty fancy code, but you
should understand what each line does, even if the exact logic of the sort is a
bit fuzzy. (It might help to think about how you might do this if the names
were written on a deck of cards - on the first pass, you could examine any two
successive cards in the deck, and exchange them if they are out of order. As you
work your way through the deck, high cards will "bubble" their way toward the
bottom of the deck. Once the first pass is complete, you do the job again. You
keep this up until all the cards are in the right order) Note that the routine
has some output statements so you can see how it works. We would pull those
lines from a real routine we wanted to use. 

..  code-block:: c

    void BubbleSort(string A[], int count) {
        for(int i=0;i<count-1;i++) {
            cout << "Pass" << i << endl;
            for(int j=0;j<count-i-1;j++) {
                cout << A[j] << " " << A[j+1] << endl;
                if(A[j] > A[j+1]) {
                    string temp = A[j];
                    A[j] = A[j+1];
                    A[j+1] = temp;
                } 
            }
        }
    }

This code sweeps over the array of strings from top to bottom. On the first
loop it compares the first two items to see if they are in order. If they are,
it moves on to the next two items and check them. If the first two are out of
order, it swaps them moving the first one down. It them moves down and repeats
this check on the next two. As it works its way through the array, you might be
able to see that the "bigger" (in an alphabetic sense) string "bubbles" to the
bottom of the array. We go back up to the top and do this again. Since we
already put the last string at the bottom, we can stop one item from the bottom
on the second pass, effectively bubbling the second to the last string into
place. Keep this up and eventually, all the strings are in place. Think about
it!

Here is what we get:

..  code-block:: bash

    Dinosaur,Dino
    Flintstone,Fred
    Flintstone,Wilma
    Rubble,Barney


Cool! Looks like we got it done! 

Here is the entire program: 

..  code-block:: c

    #include <cstdlib>
    #include <iostream>
    #include <fstream>
    #include <string>
    
    using namespace std;
    void BubbleSort(string[], int);
    
    int main(int argc, char * argv[])
    {
        string names[25];
        string myline;
        int i = 0;
        ifstream inFile;
        ofstream outFile;
        
        cout << "Name Sorter Program" << endl;
        inFile.open("names.dat");
        inFile >> myline;
        while(!inFile.eof()) {
            cout << myline.length() << " " << myline << endl;
            names[i++] = myline;
            inFile >> myline;
        }
        inFile.close();
        cout << "You read " << i << " names." << endl;   
        cout << "Sorting data" << endl;
        BubbleSort(names, i);
           
        outFile.open("SortedNames.dat");
    
        for(int j=0;j<i;j++) {
            outFile << names[j] << endl;
        }
        outFile.close();
        system("PAUSE");
        return EXIT_SUCCESS;
    }
    
    void BubbleSort(string A[], int count) {
        for(int i=0;i<count-1;i++) {
            cout << "Pass" << i << endl;
            for(int j=0;j<count-i-1;j++) {
                cout << A[j] << " " << A[j+1] << endl;
                if(A[j] > A[j+1]) {
                    string temp = A[j];
                    A[j] = A[j+1];
                    A[j+1] = temp;
                } 
            }
        }
    }


This weeks assignment
#####################

Before you try this week's lab, I want you to set up the code in this lecture and make sure you can make it work. Change the list of names and make it bigger to make sure your code works the way it is supposed to! Then you should be in good shape to try the lab.

This is the last lecture on material you are responsible for in this class. I have two additional lectures that I will post to give you a look at what else is available in the world of computer programming, but that material is not anything you need to learn for the last exam

Be sure to check out the end of term guidance posted on the web!