This project has moved. For the latest updates, please go here.

Reading from dense matrix performance

Sep 28, 2013 at 6:19 PM
I have a problem with reading operations on dense matrix. I use MKL, but it's very slow. Can you look at my code and adivse something?

This is the fastest code I wrote. About 6s with matrix ~20 000x1000
public static double LogLikelihood(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
{
    int rowCount = x.RowCount;
    int colCount = x.ColumnCount;
    double[] tLogL = new double[colCount];

    var tmpX = x.ToArray(); // this can be strange for you, but it's actually better than reading directly from DenseMatrix object
    Parallel.For(0, colCount, j =>
    {
        for (int i = 0; i < rowCount; i++)
        {
            if (tmpX[i, j] != 0)
            {
                tLogL[j] += tmpX[i, j] * Math.Log(pd.At(j) * pwd.At(i, j));
            }
        }
    });
return tLogL.Sum();
}
Coordinator
Sep 29, 2013 at 10:23 PM
Edited Sep 30, 2013 at 11:33 AM
Hi,

I've run a couple benchmarks that may give some ideas how it could be made faster, but first a few observations:

MKL is not used here since the code does not actually perform any vector, matrix or linear algebra computations at all but just accesses the data structures. So it doesn't matter whether it is enabled or not.

Then, concerning your comment, it is not strange but expected that directly accessing arrays is much faster than through some abstraction over it like a matrix. However, converting to a completely different format like in the "ToArray" call can be expensive itself and in some cases even be a bigger factor than what's saved by the direct access afterwards. Indeed, in my case, B was actually much faster than A. But if you know that your matrix is dense, then you can actually access the raw storage of a matrix directly without any copying or conversion, which is generally the fastest option. See C and D.

Each timing provided is the average of 5 runs on my machine. X and pwd are randommatrices of 20'000x1000 and pd is a random vector of length 1000. Because of the Gamma distributions, all fields are >= 0. The X, pwd and pd are reused in each run and their creation is not included in the timing:
var x = DenseMatrix.CreateRandom(20000, 1000, new Gamma(2, 0.5, new MersenneTwister(100)));
var pwd = DenseMatrix.CreateRandom(20000, 1000, new Gamma(2, 0.5, new MersenneTwister(101)));
var pd = DenseVector.CreateRandom(1000, new Gamma(2, 0.5, new MersenneTwister(102)));
  • A: 0.499s - this is exactly your implementation
  • B: 0.096s - here I access the x-matrix directly (instead of copying to a 2D array first)
  • C: 0.064s - here I access the raw matrix storage directly
  • D: 0.052s - here I access the raw matrix storage directly and skip creating the tLogL-array and the 0-check but just return the sum
It seems the performance characteristics on my machine are quite different from yours. You mentioned 6s for A, and I interpret your comment as if B was even slower on your machine. Can you confirm that the B below is indeed slower than A on your machine, and that you're running a release build, with no debugger attached?

Thanks,
Christoph
public static double A(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
{
    int rowCount = x.RowCount;
    int colCount = x.ColumnCount;
    var tLogL = new double[colCount];
    var xma = x.ToArray();
    Parallel.For(0, colCount, j =>
    {
        for (int i = 0; i < rowCount; i++)
        {
            if (xma[i, j] != 0)
            {
                tLogL[j] += xma[i, j]*Math.Log(pd.At(j)*pwd.At(i, j));
            }
        }
    });
    return tLogL.Sum();
}
public static double B(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
{
    int rowCount = x.RowCount;
    int colCount = x.ColumnCount;
    var tLogL = new double[colCount];
    Parallel.For(0, colCount, j =>
    {
        for (int i = 0; i < rowCount; i++)
        {
            var xij = x.At(i, j);
            if (xij != 0)
            {
                tLogL[j] += xij * Math.Log(pd.At(j) * pwd.At(i, j));
            }
        }
    });
    return tLogL.Sum();
}
public static double C(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
{
    int rowCount = x.RowCount;
    int colCount = x.ColumnCount;
    var tLogL = new double[colCount];
    var xd = ((DenseColumnMajorMatrixStorage<double>) x.Storage).Data;
    var pwdd = ((DenseColumnMajorMatrixStorage<double>) pwd.Storage).Data;
    var pdd = ((DenseVectorStorage<double>) pd.Storage).Data;
    CommonParallel.For(0, colCount, 32, (a, b) =>
    {
        int index = a*rowCount;
        for (int j = a; j < b; j++)
        {
            for (int i = 0; i < rowCount; i++)
            {
                int ij = index++;
                var xij = xd[ij];
                if (xij != 0)
                {
                    tLogL[j] += xij*Math.Log(pdd[j]*pwdd[ij]);
                }
            }
        }
    });
    return tLogL.Sum();
}
public static double D(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
{
    int rowCount = x.RowCount;
    int colCount = x.ColumnCount;
    var xd = ((DenseColumnMajorMatrixStorage<double>)x.Storage).Data;
    var pwdd = ((DenseColumnMajorMatrixStorage<double>)pwd.Storage).Data;
    var pdd = ((DenseVectorStorage<double>)pd.Storage).Data;
    var parts = new ConcurrentBag<double>();
    CommonParallel.For(0, colCount, 8, (a, b) =>
    {
        int k = a * rowCount;
        double sum = 0.0;
        for (int j = a; j < b; j++)
        {
            for (int i = 0; i < rowCount; i++)
            {
                sum += xd[k]*Math.Log(pdd[j]*pwdd[k++]);
            }
        }
        parts.Add(sum);
    });
    return parts.Sum();
}
Marked as answer by cdrnet on 12/31/2013 at 1:45 AM
Sep 30, 2013 at 5:17 PM
Thank you very much for your answer. I've tested your solution and here is what I got:

Time taken for creating x: 75531 ms
Time taken for creating pwd: 80368 ms
Time taken for creating pd: 13 ms
Time taken for A: 105628 ms
Time taken for B: 164547 ms
Time taken for C: 1808 ms
Time taken for D: 1807 ms

C and D work awesome! Nevertheless it's confusing what's happening in A and B (not to mention about creating matrices...). Do you have any idea?

I work on Intel(R) Core(TM) i5-2500 CPU @ 3.30Ghz, 8GB RAM 1333Mhz, SSD disk. Application is in release mode, .NET Framework 4, VS 2012, Math.Net.Numerics 2.6.1 with MKL for Win64 1.3.0.
public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
            MathNet.Numerics.Control.LinearAlgebraProvider = new MathNet.Numerics.Algorithms.LinearAlgebra.Mkl.MklLinearAlgebraProvider();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            Stopwatch sw;

            sw = Stopwatch.StartNew();
            var x = DenseMatrix.CreateRandom(20000, 1000, new ContinuousUniform(0, 1));
            sw.Stop();
            richTextBox1.AppendText("Time taken for creating x: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);

            sw = Stopwatch.StartNew();
            var pwd = DenseMatrix.CreateRandom(20000, 1000, new ContinuousUniform(0, 1));
            sw.Stop();
            richTextBox1.AppendText("Time taken for creating pwd: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);

            sw = Stopwatch.StartNew();
            var pd = DenseVector.CreateRandom(1000, new ContinuousUniform(0, 1));
            sw.Stop();
            richTextBox1.AppendText("Time taken for creating pd: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);



            sw = Stopwatch.StartNew();
            var resA = A(x, pwd, pd);
            sw.Stop();
            richTextBox1.AppendText("Time taken for A: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);

            sw = Stopwatch.StartNew();
            var resB = B(x, pwd, pd);
            richTextBox1.AppendText("Time taken for B: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);

            sw = Stopwatch.StartNew();
            var resC = C(x, pwd, pd);
            richTextBox1.AppendText("Time taken for C: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);

            sw = Stopwatch.StartNew();
            var resD = D(x, pwd, pd);
            richTextBox1.AppendText("Time taken for D: " + sw.Elapsed.TotalMilliseconds + "ms" + Environment.NewLine);      
        }

        public double A(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
        {
            int rowCount = x.RowCount;
            int colCount = x.ColumnCount;
            var tLogL = new double[colCount];
            var xma = x.ToArray();
            Parallel.For(0, colCount, j =>
            {
                for (int i = 0; i < rowCount; i++)
                {
                    if (xma[i, j] != 0)
                    {
                        tLogL[j] += xma[i, j] * Math.Log(pd.At(j) * pwd.At(i, j));
                    }
                }
            });
            return tLogL.Sum();
        }

        public double B(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
        {
            int rowCount = x.RowCount;
            int colCount = x.ColumnCount;
            var tLogL = new double[colCount];
            Parallel.For(0, colCount, j =>
            {
                for (int i = 0; i < rowCount; i++)
                {
                    var xij = x.At(i, j);
                    if (xij != 0)
                    {
                        tLogL[j] += xij * Math.Log(pd.At(j) * pwd.At(i, j));
                    }
                }
            });
            return tLogL.Sum();
        }

        public double C(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
        {
            int rowCount = x.RowCount;
            int colCount = x.ColumnCount;
            var tLogL = new double[colCount];
            var xd = ((DenseColumnMajorMatrixStorage<double>)x.Storage).Data;
            var pwdd = ((DenseColumnMajorMatrixStorage<double>)pwd.Storage).Data;
            var pdd = ((DenseVectorStorage<double>)pd.Storage).Data;
            CommonParallel.For(0, colCount, 32, (a, b) =>
            {
                int index = a * rowCount;
                for (int j = a; j < b; j++)
                {
                    for (int i = 0; i < rowCount; i++)
                    {
                        int ij = index++;
                        var xij = xd[ij];
                        if (xij != 0)
                        {
                            tLogL[j] += xij * Math.Log(pdd[j] * pwdd[ij]);
                        }
                    }
                }
            });
            return tLogL.Sum();
        }

        public double D(Matrix<double> x, Matrix<double> pwd, Vector<double> pd)
        {
            int rowCount = x.RowCount;
            int colCount = x.ColumnCount;
            var xd = ((DenseColumnMajorMatrixStorage<double>)x.Storage).Data;
            var pwdd = ((DenseColumnMajorMatrixStorage<double>)pwd.Storage).Data;
            var pdd = ((DenseVectorStorage<double>)pd.Storage).Data;
            var parts = new ConcurrentBag<double>();
            CommonParallel.For(0, colCount, 8, (a, b) =>
            {
                int k = a * rowCount;
                double sum = 0.0;
                for (int j = a; j < b; j++)
                {
                    for (int i = 0; i < rowCount; i++)
                    {
                        sum += xd[k] * Math.Log(pdd[j] * pwdd[k++]);
                    }
                }
                parts.Add(sum);
            });
            return parts.Sum();
        }
    }
Coordinator
Sep 30, 2013 at 9:45 PM
I've just run exactly the code as you posted (in a form with a button and a rich-text box):

Time taken for creating x: 576.7017ms
Time taken for creating pwd: 577.3544ms
Time taken for creating pd: 0.0393ms // switching between this and something like 10ms between runs
Time taken for A: 596.2983ms
Time taken for B: 110.9812ms
Time taken for C: 79.9605ms
Time taken for D: 70.3425ms

Just to clarify I'm reading the numbers correctly: running this takes almost 3 minutes on your setup? Then your timings do not show any sub-millisecond digits. Did you drop them manually (and added a space before the 'ms')? If not, could you add the following line to the beginning of your button1_Click method and see what it shows?
richTextBox1.AppendText("Frequency:" + Stopwatch.Frequency + "; HighRes:" + Stopwatch.IsHighResolution + Environment.NewLine);
For reference, on my system it prints "Frequency:2338115; HighRes:True"

My setup: 2x Intel(R) Xeon(R) CPU E5-2609 @ 2.40GHz, 32GB RAM 1066MHz/ECC, Win 8.0 x64, compiled against .Net 4.5 and Math.NET Numerics v3.0.0-alpha4 on VS2012.

If A really takes 105s on your system, as opposed to 0.6s on mine, I'd like to know why it seems to be 200 times slower on your machine - maybe it's something we can fix. Also, didn't you mention A originally took just 6s in the original post? What was different there than in the last run? Was it just different data?

In comparison:
  • CPU: Both are from the sandy-bridge generation so they should be comparable. I have twice the cores, but they are clocked much slower than yours (2.4 vs 3.3 GHz). I have more L3 cache though (2x10MB vs 6MB).
  • RAM: I have more, but again yours is clocked faster. On the other hand I have twice the channels and thus almost twice the memory bandwidth (34.1 vs 21 GB/s) - per CPU
  • CLR: I'm on 4.5 as opposed to 4.0. I don't remember whether there have been any optimizations around large arrays though (except actually supporting much larger ones)
  • Math.NET Numerics: v3.0.0-alpha4 instead of 2.6.1 but that should not make any difference at all in this case.
Note that we're also benchmarking the logarithm function here - after all we're calling it 20 million times. Indeed, if I replace log(x) with just (x), A drops about ~20% to 490ms and D even drops ~80% to only 12ms on my system in a quick experiment. Does it make a comparable difference in your setup?

Another observation: if I run the implementations in a long loop to keep them busy for a while, A only causes about 50% CPU usage while C saturates all cores to 100%. CommonParallel does use the TPL Parallel class internally as well, but explicitly defines a partitioning strategy which seems to be paying off.

Thanks,
Christoph
Sep 30, 2013 at 11:23 PM
Edited Oct 1, 2013 at 12:18 AM
Yes, you're reading the numbers correctly. I removed sub-milliseconds manually, because I thought they are irrelevant.

I did some tests .NET 4.0 vs. 4.5 (matrices 20000 x 1000):

1-st run (release mode, .NET 4.0):
Time taken for creating x: 75676,2516ms
Time taken for creating pwd: 80250,4029ms
Time taken for creating pd: 13,7366ms
Time taken for A: 111823,3302ms
Time taken for B: 171528,7018ms
Time taken for C: 1957,538ms
Time taken for D: 1943,2011ms

2-nd run (release mode, .NET 4.0):
Time taken for creating x: 92882,2734ms
Time taken for creating pwd: 83404,0522ms
Time taken for creating pd: 13,9129ms
Time taken for A: 115798,2806ms
Time taken for B: 179194,6662ms
Time taken for C: 1811,7538ms
Time taken for D: 1956,0585ms

3-rd run (release mode, .NET 4.5):
Time taken for creating x: 124662,1937ms
Time taken for creating pwd: 127566,6128ms
Time taken for creating pd: 13,9113ms
Time taken for A: 114932,8443ms
Time taken for B: 162376,3638ms
Time taken for C: 2110,7587ms
Time taken for D: 2116,1003ms

4-th run (release mode, .NET 4.5)
Time taken for creating x: 73257,2871ms
Time taken for creating pwd: 78052,579ms
Time taken for creating pd: 13,3851ms
Time taken for A: 116578,3244ms
Time taken for B: 187353,0683ms
Time taken for C: 2193,5422ms
Time taken for D: 2239,871ms

This is my CPU during creating matrices:
Image

and this when executing methods:
Image

All in all I'm quite pleased because D method works very good for me.
Coordinator
Oct 1, 2013 at 12:05 AM
Edited Oct 1, 2013 at 12:39 AM
Thanks for the follow up runs. Yes, it is also slightly faster for me on .Net 4.0 (C: ~63 ms, D: ~50 ms)

Thanks,
Christoph
Coordinator
Oct 1, 2013 at 12:24 AM
Edited Oct 1, 2013 at 12:24 AM
Re CPU usage charts: looks like we should consider to parallelize matrix initialization