READ [FILE ("fileName" | fileNameStringVariable)] | ARGFILE] [MISSING literalNumber] [ROWS] [HEADING] [FILL] resultVariable[formatSpec] {resultVariable[formatSpec]}

Reads a file into one or more result variables (vectors). The keywords may be written in any order on the command line.

By default, each column in the file is copied into a vector listed in the READ command. The first column goes into the first vector, the second column into the second vector, etc.

If the rows keyword is present, then each row (line) in the file is copied into a vector listed in the READ command. The columns in the file must be separated by delimiters or width fields as described below.

By default, if there are fewer columns in the file than vectors on the READ command line, then the remaining vectors will be empty. If the fill keyword is present, then the remaining vectors will contain NaNs. If there are more columns in the file than vectors on the READ command line, the extra columns will be ignored.

If the rows keyword is present, then if there are fewer rows (lines) in the file than vectors on the READ command line, then the remaining vectors will be empty by default. If the fill keyword is also present, then the remaining vectors will contain NaNs. If there are more rows (lines) in the file than vectors on the READ command line, the extra columns will be ignored.

If there is an empty or blank field between delimiters or in a field defined by a width format, it will be interpreted as NaN.

The data in the file may be numbers, including "." and "NaN" (without the quotes), and strings that are the names of named constants. If the names of named constants are used, those constants must be defined in the program before the READ command is reached. See the last example in the right panel.

If the file keyword and its argument are absent, an Open File dialog will appear requesting the user to select a file.

If the file keyword and its filename argument are present, the filename may be the complete path name or a relative path. A relative path will be relative to the current user directory (the directory containing the Statistics101.jar file).

The filename must be enclosed in double-quotes.

Nested directories (such as "C:/dir1/dir2/myFile") should be delimited by "/" or by "\\" characters. The reason for the double backslash is that in Java as in other C-derived languages, a backslash is interpreted as an escape character and two backslashes are interpreted as a single true backslash. This is true whether using Statistics101 in Windows, Unix, Linux, or MacOS.

When the user program executes the READ command, if the file is not found, a dialog will appear asking if the user wants to search for the file. If the user clicks "yes", a file dialog will appear to allow the search. If the user clicks "no" or cancels the search, an error message will be printed to the Output Window and the program will be aborted.

If the argfile keyword is present, then the filename will be the path entered using the -r switch on the command line that invoked Statistics101. See Command Line Invocation. If the -r switch is absent or did not name a file, then the user will be requested to select a file.

If the missing keyword and its argument are present, any value read from the file that matches the number will be interpreted as NaN (Not a Number). Also, if the string "NaN" (without quotes and without regard to case) is found in the file, it will be interpreted as NaN even if the missing keyword is present.

If the missing keyword is absent and the word "NaN" (without quotes and without regard to case) is found in the input file, it is interpreted as NaN. If "NaN" is used in a fixed column-width format input file, the column width must be at least three for any columns containing "NaN".

The input file's columns may be described using an optional formatSpec that specifies the column's width. A width formatSpec follows the variable name it applies to and consists of a percent sign ("%") followed by an integer that says how many characters make up that variable's column. See examples at right.

The general READ format specification is:

%width

where:

  • % introduces the formatSpec, separating it from its related variable name.

  • width a number that specifies the exact width, in characters, of this variable's data in the file.

If one result variable vector has a formatSpec, they all must have one.

In the absence of a formatSpec, the READ command assumes that the columns in the input file are separated by

  • A comma,

  • A tab, or

  • One or more spaces.

In the input file, leading blanks are ignored, blanks on either side of a comma or tab are ignored, and multiple consecutive blanks are considered to be one delimiter.

It is best to use the same delimiter (all commas, or all tabs, or all spaces) throughout one input file. It is recommended that, if possible, you use commas as your delimiter to avoid possible interpretation surprises as shown in the examples under the heading "Differences between delimiters" at right.

The keyword heading informs the READ that the file contains column headings or row headings, depending on whether the rows keyword is absent or present. The column headings must be the first line in the input file. Row headings must be the first item in each row. If headings are present but the heading keyword is not, that will cause an error in reading the file. If headings are absent but the heading keyword is present, then that will also cause a read error. The WRITE command can output data with or without column or row headings.

The following command will ask the user for the input file.

READ a b c

The following command will try to open the file "myFile" and will not ask for the filename

READ file "myFile" a b c

The next command,

READ file "c:\\folder1\\myFile" a b c

Is equivalent to:

READ file "c:/folder1/myfile" a b c

The next command interprets any data item in the file, whose value is 99, as a missing data value (NaN).

READ missing 99 file "DataFile.txt" a b c

The following command expects that the input file has no delimiters and that the first two characters of each line contain data for variable "a", the next three characters belong to "b", and the next four characters belong to "c".

READ file "DataFile2.txt" missing 13 a%2 b%3 c%4

Given a file "c:/input.txt" with these contents:

1 11
2 22
3 33
4 44
5 55
6 66

the following program:

READ file "c:/input.txt" vec1 vec2 vec3
PRINT vec1 vec2 vec3

will produce the following output:

vec1: (1.0 2.0 3.0 4.0 5.0 6.0)
vec2: (11.0 22.0 33.0 44.0 55.0 66.0)
vec3: ()

While adding the fill keyword:

READ file "c:/input.txt" FILL vec1 vec2 vec3
PRINT vec1 vec2 vec3

will produce the following output:

vec1: (1.0 2.0 3.0 4.0 5.0 6.0)
vec2: (11.0 22.0 33.0 44.0 55.0 66.0)
vec3: (NaN NaN NaN NaN NaN NaN)

Differences between delimiters

Since multiple blanks are ignored in the absence of format specifications,but multiple commas or tabs are not ignored, be aware that a file with these contents:

1 11 111
2 22 222
3 33 333
4 444
5 55 555
6 66 666
7

is not read the same as a file with these contents:

1,11,111
2,22,222
3,33,333
4,,444
5,55,555
6,66,666
7,,

If the file "c:/input.txt" contained the first set of numbers, then this program,

READ file "c:/input.txt" vec1 vec2 vec3
PRINT vec1 vec2 vec3

would produce:

vec1: (1.0 2.0 3.0 4.0 5.0 6.0 7.0)
vec2: (11.0 22.0 33.0 444.0 55.0 66.0)
vec3: (111.0 222.0 333.0 555.0 666.0)

Here, the "444" is read as being in the second column because multiple blanks are considered to be equivalent to one blank. Also, vec3 passes over the blank space left by moving 444 to vec2.

Using the fill keyword in the above READ command would change the output to this:

vec1: (1.0 2.0 3.0 4.0 5.0 6.0 7.0)
vec2: (11.0 22.0 33.0 444.0 55.0 66.0 NaN) vec3: (111.0 222.0 333.0 NaN 555.0 666.0 NaN)

Note that the fill keyword caused all the vectors to be extended to match the length of the longest vector, with the shorter vectors being lengthened by filling with NaNs.

If c:/input.txt contained the second set of numbers, comma delimited, then the program would produce this (without the fill keyword):

vec1: (1.0 2.0 3.0 4.0 5.0 6.0 7.0)
vec2: (11.0 22.0 33.0 55.0 66.0)
vec3: (111.0 222.0 333.0 444.0 555.0 666.0)

And if run with the fill keyword, it would produce this output:

vec1: (1.0 2.0 3.0 4.0 5.0 6.0 7.0)
vec2: (11.0 22.0 33.0 NaN 55.0 66.0 NaN) vec3: (111.0 222.0 333.0 444.0 555.0 666.0 NaN)

To avoid such possible confusion, it is best to use commas for the delimiter in your files if possible.

This next example shows the use of named constants as data in a data file. Here is the contents of the file "myData":

male,child,10,48,78
female,adult,31,62,110 female,child,14,.,70 male,adult,25,72,210

Here is the program to read the file:

NAME male female
NAME (1 2) child adult READ file "myData" Sex Status Age Height Weight PRINT table sex status age height weight

and here is the output of the program in the Output Window (compressed to fit):

Sex   Status   Age  Height Weight        
male child 10 48 78 female adult 31 62 110 female child 14 NaN 70 male adult 25 72 210