20210924, 06:17  #56  
Jul 2003
So Cal
3·17·43 Posts 
Quote:


20210924, 06:21  #57  
Jul 2003
So Cal
100010010001_{2} Posts 
Quote:


20210924, 12:36  #58 
Apr 2020
2×251 Posts 

20210924, 13:40  #59 
"Carlos Pinho"
Oct 2011
Milton Keynes, UK
3·1,657 Posts 

20210924, 15:13  #60 
Jul 2003
So Cal
3·17·43 Posts 
A large fraction encounter issues when exceeding 1GB/thread, so I stay a little below that.

20210924, 15:50  #61 
Apr 2020
1F6_{16} Posts 
If lims have to stay at 250M, it would probably be possible to stretch the upper limit of doable jobs a bit by using 3LP on both sides to catch some of the relations that are lost due to the low lims. This makes sec/rel ~30% worse but increases yield by ~50%, while also increasing the number of relations needed by some unknown amount (almost certainly below 50%) and making LA that bit harder as a result.
But as long as you can cope with lpb 34/34 and 3LP on only one side, there shouldn't be any need for this. 
20211022, 13:34  #62 
Jun 2012
Boulder, CO
328_{10} Posts 
In general, given a GPU with X GB RAM, and an N x N matrix, is there a way to determine (reasonably) optimal VBITS and block_nnz values?

20211022, 23:00  #63 
Jul 2003
So Cal
3×17×43 Posts 
Technically it's an MxN matrix with M slightly less than N, but for this question we can approximate it as NxN.
Volta (and I'm hoping Turing and Ampere) GPUs aren't very sensitive to the block_nnz value, so just keep it at its default 1.75 billion. The actual limit is that the number of nonzeros in a cub SpMV call is stored in an int32 so each matrix block must have less than 2^31 nonzeros. block_nnz sets an estimate, especially for the transpose matrix, so I've been a bit conservative setting it at 1.75B. We want to keep the number of blocks reasonably small since each block for both the normal and transpose matrix needs a 4*(N+1)byte row offset array in addition to the 4*num_nonzerosbyte column array in GPU memory. For VBITS, a global memory fetch on current nVidia GPUs by default moves 64 bytes into the L2 cache (although this can be reduced to 32 bytes on A100). With VBITS=128, we are only using 16 bytes of that data with little chance of cache reuse in most of the matrix. Increasing VBITS uses more of the data and thus more efficiently uses global memory bandwidth in the SpMV. However, each iteration also has multiple VBITSxN • NxVBITS dense matrix multiplications which require strided access to arrays. This strided access has a larger impact at VBITS=512. Also, the vectors require 7*N*VBITS/8 bytes of GPU memory. In practice on the V100 I've gotten about equal performance from VBITS of 384 and 512, and poorer performance with decreasing values. Of the two I use 384 since it requires less GPU memory. However, lower VBITS values are useful if GPU memory is tight. Once I have access to an A100 I will compare using VBITS=256 with cudaLimitMaxL2FetchGranularity of 32 to VBITS=384 or 512 with the default. So, in short, unless GPU memory is tight use VBITS=384 and the default block_nnz on V100 and likely on A100 as well. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
msieve on KNL  frmky  Msieve  3  20161106 11:45 
Using msieve with c  burrobert  Msieve  9  20121026 22:46 
msieve help  em99010pepe  Msieve  23  20090927 16:13 
fun with msieve  masser  Sierpinski/Riesel Base 5  83  20071117 19:39 