This project has moved and is read-only. For the latest updates, please go here.

Adding more Parallel methods

Aug 19, 2009 at 10:41 AM

Hi Guys,

I'm working on adding a couple more methods to the Parallel class. The other version of the For method that takes init, action, and finializer delgates so when can aggregate results across threads and the ForEach version. The For method should be straight forward, but I'm having a problem with the ForEach method.  I cannot come up with an efficient implementation.  Does anyone have any thoughts on how we can implement it?

Thanks,
Marcus 

Aug 19, 2009 at 1:53 PM

Hi Marcus,

Yes, I thought a bit about ForEach as well when I worked on it and have not come up with a nice solution yet either.

What I've come up so far:

  • I think we should check first whether the passed IEnumerable implements IList as well (which includes arrays and List<T>). If it does we can use For instead.

We have to assume that the provided enumerable is not thread safe, so we have to enumerate in a single thread (or synchronize access to it). Therefore:

  • We could process blockwise. I.e. get the first 512 entries and pass them to For. Then the next 512 entries etc.
  • We could process blockwise per worker thread, with increasing block size. E.g. first add a task with the first 4 entries to the queue, then one with the next 8 entries, then 16, 32 etc.

Does the mono TPL equivalent implement a parallel for-each?

Thanks,
Chris

Aug 19, 2009 at 3:05 PM
cdrnet wrote:
  • We could process blockwise per worker thread, with increasing block size. E.g. first add a task with the first 4 entries to the queue, then one with the next 8 entries, then 16, 32 etc.

I seem to recall reading a blog or msdn article or something from MS that this is the strategy they use. I can't find the article anymore though. I'll keep looking.

J

Aug 19, 2009 at 3:12 PM

Hi Chris,

>I think we should check first whether the passed IEnumerable implements IList as well (which includes arrays and List<T>). If it does we can use For instead.

Definitely.

>We could process blockwise per worker thread, with increasing block size. E.g. first add a task with the first 4 entries to the queue, then one with the next 8 entries, then 16, 32 etc.

This is a good idea, I'll give it a try.

>Does the mono TPL equivalent implement a parallel for-each?

They do. I wasn't sure if looking at their code would taint ours (if we implemented it in the same way).

Thanks!
Marcus 

 

Aug 19, 2009 at 3:30 PM

There was one with nicer pictures but here is at least a pointer. -J

 

Aug 19, 2009 at 5:18 PM

cool thanks

Marcus

Aug 21, 2009 at 2:10 PM

Hey Guys,

I'll be out next week and wasn't able to finish adding the new parallel methods.  I added a first take in my parallel branch (it is rough and there is still a bunch that needs to be done).  My goal is to have them done the first week of Septmenber. 

Regards,

Marcus

Sep 1, 2009 at 12:20 PM

I checked in the new parallel methods.  There is a lot of overhead in the ForEach methods, so the they shouldn't be used for computationally light tasks -  for example just doing an add and an assignment is about 40% slower than the non-parallel version.  

Regards,
Marcus

Sep 2, 2009 at 6:42 PM

Thanks for the work, I'll have a look at it.

In the case of DenseVector, Parallel.ForEach should almost always get redirected to Parallel.For in the end since we're dealing with native arrays. Does this mean that this benchmark result applies to Parallel.For as well, not just Parallel.ForEach?

Do we know whether the overhead comes from the threading, the packing and delegation, subtle memory cache alignment issues, or rather from the suboptimal iteration itself (full direct array for-loop is usually the best optimized and fastest way)? No need to start measuring, but if we already know it it would be interesting.

Thanks,
Chris

Sep 2, 2009 at 7:16 PM
Edited Sep 2, 2009 at 7:22 PM

>In the case of DenseVector, Parallel.ForEach should almost always get redirected to Parallel.For in the end since we're dealing with native arrays.

I don't think that is the case since the Vector enumerator is not of type IList (its whatever the complier generated class is).  

>Does this mean that this benchmark result applies to Parallel.For as well, not just Parallel.ForEach?

The Parallel.For overhead is much less than the Parallel.ForEach. The worst case scenario was usually just no improvement over the serial for.

>Do we know whether the overhead comes from the threading,

I couldn't get DotTrace or AQtime to profile the code correctly (I'm going to try again when I do the sparse vectors). But playing around with the code, it seemed like most of the overhead was from copying the elements to the internal arrays.

Generally we'll only use the Parallel.ForEach for the sparse matrices and vectors.

Thanks,
Marcus 

Sep 2, 2009 at 10:14 PM

>>In the case of DenseVector, Parallel.ForEach should almost always get redirected to Parallel.For in the end since we're dealing with native arrays.
> I don't think that is the case since the Vector enumerator is not of type IList (its whatever the complier generated class is).

Can't we apply ForEach directly on _Data, which is a double array? Btw, Iridium's Vector explicitly implements IList<T>, but for other reasons.

I just extended my simple perf app and did some benchmarks. Each number is the average of 16 iterations, after two "warm up"/jit iterations. Nevertheless, the benchmarks are quite unreliable, so use with care.

For(array): s =>{ for (int i = 0; i < s.Length; i++) { action(s[i]); } }
Parallel.For(array): s => Parallel.For(0, s.Length, i => action(s[i]))
Foreach(array): s =>{ foreach (var x in s) { action(x); } }
Parallel.ForEach(array): s => Parallel.ForEach(s, action)

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: False
For(array): 934705 Ticks (261 ms)
Parallel.For(array): 894920 Ticks (250 ms)

Foreach(array): 895098 Ticks (250 ms)
Parallel.ForEach(array): 979485 Ticks (273 ms)


Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: True
For(array): 913335 Ticks (255 ms)
Parallel.For(array): 770182 Ticks (215 ms)

Foreach(array): 925254 Ticks (258 ms)
Parallel.ForEach(array): 796317 Ticks (222 ms)


Executing 'nop' on 1048576 Items, parallel: False
For(array): 27954 Ticks (7 ms)
Parallel.For(array): 33775 Ticks (9 ms)
Foreach(array): 21971 Ticks (6 ms)
Parallel.ForEach(array): 53363 Ticks (14 ms)

Executing 'nop' on 1048576 Items, parallel: True
For(array): 21647 Ticks (6 ms)
Parallel.For(array): 384385 Ticks (107 ms)
Foreach(array): 25010 Ticks (6 ms)
Parallel.ForEach(array): 405599 Ticks (113 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: False
For(array): 35 Ticks (0 ms)
Parallel.For(array): 36 Ticks (0 ms)
Foreach(array): 36 Ticks (0 ms)
Parallel.ForEach(array): 37 Ticks (0 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: True
For(array): 35 Ticks (0 ms)
Parallel.For(array): 171267 Ticks (47 ms)
Foreach(array): 35 Ticks (0 ms)
Parallel.ForEach(array): 122328 Ticks (34 ms)

Executing 'nop' on 16 Items, parallel: False
For(array): 8 Ticks (0 ms)
Parallel.For(array): 9 Ticks (0 ms)
Foreach(array): 9 Ticks (0 ms)
Parallel.ForEach(array): 10 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: True
For(array): 8 Ticks (0 ms)
Parallel.For(array): 195750 Ticks (54 ms)

Foreach(array): 9 Ticks (0 ms)
Parallel.ForEach(array): 24496 Ticks (6 ms)

Then I tried again, but with a Parallel.ForEach overload that takes a T[] array and directly forwards to Parallel.For because I had the impression that casting an array to IEnumerable{T} and then to IList{T} and access it though that interface is much slower than expected. The new results:

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: False
For(array): 935048 Ticks (261 ms)
Parallel.For(array): 896533 Ticks (250 ms)
Foreach(array): 895587 Ticks (250 ms)
Parallel.ForEach(array): 952082 Ticks (265 ms)


Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: True
For(array): 911662 Ticks (254 ms)
Parallel.For(array): 809160 Ticks (226 ms)
Foreach(array): 928030 Ticks (259 ms)
Parallel.ForEach(array): 753720 Ticks (210 ms)


Executing 'nop' on 1048576 Items, parallel: False
For(array): 23621 Ticks (6 ms)
Parallel.For(array): 33787 Ticks (9 ms)
Foreach(array): 21896 Ticks (6 ms)
Parallel.ForEach(array): 57118 Ticks (15 ms)

Executing 'nop' on 1048576 Items, parallel: True
For(array): 21997 Ticks (6 ms)
Parallel.For(array): 387417 Ticks (108 ms)
Foreach(array): 25062 Ticks (7 ms)
Parallel.ForEach(array): 314128 Ticks (87 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: False
For(array): 36 Ticks (0 ms)
Parallel.For(array): 36 Ticks (0 ms)
Foreach(array): 35 Ticks (0 ms)
Parallel.ForEach(array): 37 Ticks (0 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: True
For(array): 37 Ticks (0 ms)
Parallel.For(array): 171301 Ticks (47 ms)
Foreach(array): 35 Ticks (0 ms)
Parallel.ForEach(array): 146779 Ticks (41 ms)

Executing 'nop' on 16 Items, parallel: False
For(array): 8 Ticks (0 ms)
Parallel.For(array): 10 Ticks (0 ms)
Foreach(array): 9 Ticks (0 ms)
Parallel.ForEach(array): 10 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: True
For(array): 9 Ticks (0 ms)
Parallel.For(array): 171274 Ticks (47 ms)
Foreach(array): 9 Ticks (0 ms)
Parallel.ForEach(array): 146344 Ticks (40 ms)

Then I wondered why we don't get better speed up, and tried replacing Thread.Sleep(100) in Task.Wait with Thread.Sleep(0):

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: False
For(array): 932933 Ticks (260 ms)
Parallel.For(array): 895800 Ticks (250 ms)
Foreach(array): 895194 Ticks (250 ms)
Parallel.ForEach(array): 950859 Ticks (265 ms)

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: True
For(array): 912945 Ticks (255 ms)
Parallel.For(array): 485603 Ticks (135 ms)

Foreach(array): 927728 Ticks (259 ms)
Parallel.ForEach(array): 484545 Ticks (135 ms)


Executing 'nop' on 1048576 Items, parallel: False
For(array): 21978 Ticks (6 ms)
Parallel.For(array): 34048 Ticks (9 ms)
Foreach(array): 21928 Ticks (6 ms)
Parallel.ForEach(array): 58254 Ticks (16 ms)

Executing 'nop' on 1048576 Items, parallel: True
For(array): 21700 Ticks (6 ms)
Parallel.For(array): 17851 Ticks (4 ms)

Foreach(array): 21979 Ticks (6 ms)
Parallel.ForEach(array): 17846 Ticks (4 ms)


Executing 'x=exp(tanh(x))' on 16 Items, parallel: False
For(array): 20 Ticks (0 ms)
Parallel.For(array): 20 Ticks (0 ms)
Foreach(array): 19 Ticks (0 ms)
Parallel.ForEach(array): 21 Ticks (0 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: True
For(array): 21 Ticks (0 ms)
Parallel.For(array): 55 Ticks (0 ms)

Foreach(array): 20 Ticks (0 ms)
Parallel.ForEach(array): 54 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: False
For(array): 7 Ticks (0 ms)
Parallel.For(array): 7 Ticks (0 ms)
Foreach(array): 7 Ticks (0 ms)
Parallel.ForEach(array): 8 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: True
For(array): 7 Ticks (0 ms)
Parallel.For(array): 47 Ticks (0 ms)

Foreach(array): 6 Ticks (0 ms)
Parallel.ForEach(array): 47 Ticks (0 ms)

Looks much better. What do you think, any concerns using 0 instead of 100?

Thanks,
Chris

Sep 2, 2009 at 10:19 PM

I think I should try again, but taking the median time instead of the average...

Sep 3, 2009 at 6:58 AM

Hi Chris,

>Can't we apply ForEach directly on _Data, which is a double array?

Of course.  I thought you were saying calling Parallel.ForEach on the DenseVector itself would get forwarded to Parallel.For.  


>Btw, Iridium's Vector explicitly implements IList<T>, but for other reasons.

I'm still adding methods back to the Vector class.  For dnA we originally implemented IList but dropped for some reason (I don't recall).  What are your thoughts on the Numeric's Vector - should it implement IList?


>Looks much better. What do you think, any concerns using 0 instead of 100?

We should change it to 0.  After I changed the Task class, I forgot to optimize the value for Thread.Sleep - duh!  Could rerun your test using SpinWait instead of Sleep?

Thanks,
Marcus 

Sep 3, 2009 at 7:51 AM

Results with SpinWait(0):

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: False
For(array): 932058 Ticks (260 ms)
Parallel.For(array): 894900 Ticks (250 ms)
Foreach(array): 894995 Ticks (250 ms)
Parallel.ForEach(array): 977320 Ticks (273 ms)

Executing 'x=exp(tanh(x))' on 1048576 Items, parallel: True
For(array): 908375 Ticks (253 ms)
Parallel.For(array): 610533 Ticks (170 ms)

Foreach(array): 925608 Ticks (258 ms)
Parallel.ForEach(array): 708516 Ticks (197 ms)


Executing 'nop' on 1048576 Items, parallel: False
For(array): 21790 Ticks (6 ms)
Parallel.For(array): 33782 Ticks (9 ms)
Foreach(array): 21592 Ticks (6 ms)
Parallel.ForEach(array): 53392 Ticks (14 ms)

Executing 'nop' on 1048576 Items, parallel: True
For(array): 29244 Ticks (8 ms)
Parallel.For(array): 25302 Ticks (7 ms)
Foreach(array): 21842 Ticks (6 ms)
Parallel.ForEach(array): 103657 Ticks (28 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: False
For(array): 21 Ticks (0 ms)
Parallel.For(array): 21 Ticks (0 ms)
Foreach(array): 20 Ticks (0 ms)
Parallel.ForEach(array): 22 Ticks (0 ms)

Executing 'x=exp(tanh(x))' on 16 Items, parallel: True
For(array): 20 Ticks (0 ms)
Parallel.For(array): 75 Ticks (0 ms)
Foreach(array): 20 Ticks (0 ms)
Parallel.ForEach(array): 65 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: False
For(array): 7 Ticks (0 ms)
Parallel.For(array): 7 Ticks (0 ms)
Foreach(array): 7 Ticks (0 ms)
Parallel.ForEach(array): 8 Ticks (0 ms)

Executing 'nop' on 16 Items, parallel: True
For(array): 7 Ticks (0 ms)
Parallel.For(array): 46 Ticks (0 ms)
Foreach(array): 7 Ticks (0 ms)
Parallel.ForEach(array): 50 Ticks (0 ms)

Sleep(0) seems to perform better than SpinWait(0). Although, shouldn't SpinWait(0) be kind of a nop, i.e. are we actually spinning in the while loop instead of inside of the SpinWait method? Should I try again with a different parameter? Actually, I think we don't really want to wait and block our thread while doing so with spinning, we want to have the scheduler switch back to the worker thread, so he can do its work. So I think Sleep(0) fits better here, as afaik it does exactly that.

Thanks,
Chris

Sep 3, 2009 at 8:27 AM

>Actually, I think we don't really want to wait and block our thread while doing so with spinning, we want to have the scheduler switch back to the worker thread, so he can do its work.

Right, using SpinWait  doesn't make sense.

Thanks,
Marcus