This project has moved and is read-only. For the latest updates, please go here.

Transposing an NXM matrix

Feb 28, 2011 at 10:54 PM

Please forgive my ignorance but I'm hoping if I outline my scenario I could get some feedback if I can do it with a simple implementation via Math.NET. 

I need to transpose research data from rows to columns.  My data is in numbers, string, date format with 1 variable (+ metadata) per row and I need to transpose it so that I can produce data in a single row for multivariate analysis.  I have a implementation in place but it is very slow.  My biggest challenge has been the dynamic nature of the data in that I need to go from :

R1 Number

R2 Date

R3 Number

R4 Text

to R1 C1 Number, C2 Date, C3 Number, C4 Text.

Ultimately my data is output in CSV so I could convert everything to text first however then I would lose the ability to do any post transposing summary type math although I think I can probably get away without that.

My other alternative is to simply read in all my database keys as long ints, transpose them, and send that back to the database to extract and output.

I won't have huge datasets as I believe there are limitations on the number of colums going back into SAS and SPSS etc. 

Thanks

jack

Feb 28, 2011 at 11:15 PM
Edited Mar 1, 2011 at 12:49 AM

So after trying a few things it is obvious only numbers are allowed.  I'll have to think if it makes sense to just process my pkeys (int32) as a matrix, transpose it, and then link it back together.

thanks

jack

Mar 1, 2011 at 12:36 PM

Hi Jack,

Numerics isn't geared towards what you want to do. Where do you plan on doing your analysis? SAS can handle over a million variables and can be programmed do this type of data transpose/manipulation. If you are going to be using Numerics, then you'll need to write a parser than converts your data into a CSV file (floating point only) that Numerics can parse into a matrix. No metadata at all, you'll need to keep track of that manually.

> I have a implementation in place but it is very slow.  
If you like, I can take a look at your code. This shouldn't be that slow.  You can post it here or e-mail me at marcus@cuda.net.

Regards,
Marcus 

Mar 1, 2011 at 4:06 PM

Marcus,

Thanks for the speedy reply. I think you are correct in that this isn’t really the right fit. I’m actually just the middle man, I host the data for a number of studies in an Oracle database in EAV format and have to produce a slew of generic extracts for various clients/researchers. They then chose to use SAS/SPSS/STATA or whatever to do their analysis. Some of them are quite competent, others not so. My goal is to deliver the raw data in whatever format they like either just transposing and dumping out a simple table format or doing any merging, and re-labeling to combine measures and convert the data to present the save variables at different time periods on a single line.

Also the implementation I did wasn’t using any of the numeric functionality, it was all linq and then I had a dynamic class that built a datatable that I then showed in a grid control and exported to csv/excel etc. I haven’t stepped through it in a while but the dynamic class datatable was a dog and also it ran in a Silverlight client so I could do anything in parallel.

My new implementation is only going to run on the server and email out csvs or xml.

Since my data is stored in EAV format (as rows of number, strings, dates) I have to transpose it for even the simple base case of replicating an standard table so I have to do 90% of the work for all scenarios anyways.

My hope had been that maybe I could extract all the metadata and transpose that via the LINQ queries I’ve implemented and then build a matrix of my data objects and use some fancy math algorithm to transpose it and then quickly dump out rows as text csv. When I realized the matrix couldn’t hold an object I thought maybe I could just use the primary key which is a int32 and transpose that. I could then loop through each row and get the key and pull out my data from a hashset which would be nice and speedy.

I thought that as my nxm pure dataset got larger and large then I would reap the benefits of a parallel processing algorithm to do the transpose and then I can just loop. I looked at a bunch of research papers and algoritms online for inline and out of core nxm non square transpositions and just got completely overwhelmed.

The thing is, I need to do bunch of summary counts on the number of missing variables, n/a’s, etc., so I’m still going to end up doing lots of work.

My current algorithm is to read in all the non-normalized data from the database into a set of collections. Then I do a couple of linq queries to join them all together to create two new collections. One for headers/column information and another with all my data. The data collection object contains all the necessary extract options to generate a key that I can use to loop through to create a new row. So if I am just dumping a simple data table the key is subject_time_measure but if I am creating a merged multivariate row then the key is subject_measure and the time is appended to the variable name to create a new column.

Anyhow I then use another set of linq queries to isolate a row of data and then do all my summaries and then I was creating a new row in my datatable. The new version will just create a comma delimited string. I’m hoping that with the processing on the server and not the Silverlight client I can use PLINQ and reap some benefits.

So my data comes in as Nx1, then I use linq to make M subsets of Nx1, I’m pre-fixing some identify columsn and appending some summary columsn and then I’m either building a visual NxM table (slow) or exporting a NxM csv

I’m sure at some point where NxM gets sufficiently large there would be some benefit in using a mathematical transposition of the data instead of looped linq queries but I’m not sure where. Also while my M could get quite large with a user requesting all data variables at all time periods currently my number of subjects is small enough that it might not matter (<2000).

Anyhow sorry for the big email – just thinking out loud. If you had any suggestions I would be happy to hear them.

Jack

Mar 1, 2011 at 5:48 PM

Hi Jack,

Sounds to me like what you want to achieve is essentially a PIVOT table.  Here's an example doing this directly in SQL Server, I'd expect Oracle to support this as well.

For example:

SELECT * FROM (SELECT Entity, Attribute, Value FROM test) t
PIVOT (MAX(Value) FOR Attribute IN ([A], [B], [C])) p

would turn

Entity    Attribute    Value
1    A    1331
1    B    3423432
1    C    dsf
2    B    2332
2    C    fdffdfd
3    A    32323
3    C    adsds

into

Entity    A    B    C
1    1331    3423432    dsf
2    NULL    2332    fdffdfd
3    32323    NULL    adsds

Does this come close?

Thanks,
Christoph

Mar 1, 2011 at 6:09 PM

You are correct, I really am doing a giant dynamic pivot table

However I have an unknown number of columns plus I have to do summary/aggregate math on the columns to explain some of the data issues. I could write pl/sql and dump the data out from the DB however some things are just so much easier in C#.

Thanks

jack