Thursday, 30 July 2015

Reading a CSV file in Matlab with header rows and mixed text, timestamps, integers and doubles

I wanted to read in a CSV file in Matlab, but was having problems using csvread(), which only supports files with only numeric values. It seems as though the easiest way to read in a mixed CSV file, is to read in the header first, and then read in the data using textscan.

I based my code on an answer from StackExchange. For reading the header, the following code worked well for me:
fid = fopen(file, 'r');
Header = fgetl(fid);
fclose(fid);

Header = regexp(Header, '([^,]*)', 'tokens');
Header = cat(2, Header{:});
The values are then in cells, and we need to use the function cell2mat() to convert the data in to something we can work with. This works for both strings and numerical values.

Then to read the main body of the text file, I used textscan(). This requires a string defining the format. As my format string is quite long, due to having a lot of columns, I've defined the format string separately. I have used the following code to define the format:
format_a='%u%u%s%u%u%f%f%u%u';
format_b=repmat('%f',1,109); %repeat the string '%f' 109 times
format_c='%u%u%u';

format=strcat(format_a, format_b, format_c);
This generates a long string specifying the format of the file that will be read in. The input string '%u' is an unsigned integer, and the input string is for '%s' strings. I also has a time and date column in the CSV, which I read with the datenum() function later, so for now I have treated it as a string.

The reading in of the file is done with the textscan() function in the following few lines:
fid = fopen(file, 'r');D = textscan(fid, format, 'Delimiter', ',', 'HeaderLines', 1);fclose(fid);

By setting the Delimiter, this allows other options than only the comma. Also the number of header columns can be set by the number following 'HeaderLines'.

This reads all of the data in to the variable which is saved in cells. So to read in the numbers from the cells, I have used the function cell2mat(). Here is an example of how I read in one column:
index=cell2mat(D(1));
Similarly, to read in the data from multiple columns, I've used the following code:
 data=cell2mat(D(10:118));
 This works for integers and doubles. To handle the date, I have used the following line:
timestamps=datenum(D{3}, 'dd/mm/yyyy HH:MM');
This interprets the information in the strings, and creates a "datetime" object, which can be used much more easily when plotting data.

Thursday, 16 July 2015

Matlab PLSR and PCA Tutorial - Undefined Function pca

When following the Matlab tutorial for "Partial Least Squares Regression and Principal Components Regression", there is an error with the following line:
[PCALoadings,PCAScores,PCAVar] = pca(X,'Economy',false);
The error is as follows:
 Undefined function 'pca' for input arguments of type 'double'.
The pca() function is actually the same function as the princomp() function (source), however this function doesn't take the same arguments as pca().  Hence removing all other arguments accept for "X" and replacing the function with princomp() appears to give the same results.

The new line that should replace it is as follows:
[PCALoadings,PCAScores,PCAVar] = princomp(X);
Continuing with the tutorial produces identical plots for the remainder of the tutorial.