Sparse gpuArray indexing limitations

When trying to extract a sub-matrix of a sparse gpuArray, I get,
>> S=gpuArray(sprand(100,100,0.1));
>> S(1:5,1:5)
Sparse gpuArray matrices only support referencing whole rows or whole columns.
OK, but here, I am indexing a block of complete rows and get the same error,
>> S(1:5,:)
Sparse gpuArray matrices only support referencing whole rows or whole columns.
Aside from this, what is the internal hurdle that makes subsrefing a sparse matrix so hard on the GPU? This has been a limitation for quite a while now.

3 Commenti

I do not have easy access to a gpu system at this time.
I wonder if
S(1,:)
S(:,1)
would work? That is, perhaps the limit is on indexing more than one complete row or column ?
Yes, single row/column extraction works,
>> S=gpuArray.speye(3);
>> S(1,:)
ans =
1×3 sparse gpuArray double row vector (1 nonzero)
(1,1) 1
>> S(:,1)
ans =
3×1 sparse gpuArray double column vector (1 nonzero)
(1,1) 1
But I wonder why that would be the limit. If you can do one, surely you should be able to do more?
dpb
dpb circa 4 ore fa
Modificato: dpb circa un'ora fa
Theoretically, yes it could be done but creates the complexity issue noted before so the ability isn't directly supported. Reading the gpuArray doc on <Working With Sparse GPU Arrays> again and parsing the addressing discussion more carefully, it appears to be a restriction documented by example and the implication that index is a single value, not by an explicit statement. "Sparse GPU arrays only support referencing whole rows or columns by index. For example ..." uses 5 in that example but doesn't directly state that "index" cannot be a colon expression. Like you apparently, @Matt J, I initially presumed the index could be generalized as with addressing in-memory arrays, sparse or not.
If this operation is needed, it would have to be implemented by extracting each row/column individually and catenating them in local memory as assigning values to sparse GPU arrays by index is not supported. The result would then have to be moved to a GPU array from memory. It's not clear from the documentation whether the writing restrictions precludes something like
B=gpuArray(sparse([])); % empty sparse GPU array????
for i=1:5
B=[B;S(i,:)];
end
but I suspect they do/will.
Syntax issues aside, even if it were to work, constructing a new sparse matrix dynamically would require the GPU to rebuild the entire CSR/CSC data structure for every iteration. If it were toy-sized arrays this might be feasible, but presuming real cases would be large enought to actually need sparse instead of dense arrays, the reconstruction sequentially would undoubtedly turn out to be prohibitively slow.
It's hard to wrap one's head around, but the GPU CSC/CSR storage scheme is efficient for the given array it is constructed for, but it has to be recomputed from scratch to build a new array.

Accedi per commentare.

 Risposta accettata

dpb
dpb il 3 Lug 2026 alle 21:41
Modificato: dpb 32 minuti fa
EDIT 5Jul26 -- Initial conclusion it is a bug that the documentation doesn't restrict the index to a single value instead of allowing a general indexing expression was in error given later more careful reading. It is documented implicitly by the example using a specific index value (5) and the lack of an example illustrating the multiple-valued index expression. A more explicit statement might be a useful doc enhancement with, perhaps a note in the Tips section on why this is simply not feasible.
--dpb
Technically it appears to be a bug based on the doc at gpuArray doesn't have any explicit restrictions on addressing by either full rows or columns, agreed. However, I suspect this has uncovered limitations on actual useage owing to the unique nature of how the GPU implements sparse storage. Details are lacking in the MATLAB doc's, but I venture it's owing to how the GPU stores the sparse matrix; it uses Compressed Sparse Column (CSC) or Compressed Sparse Row (CSR) layouts and since zeros are not stored, storage is represented by three arrays:
  1. Values: The actual non-zero numbers.
  2. Inner Indices: The exact row indices (for CSC) or column indices (for CSR).
  3. Outer Pointers: A map specifying the physical offset index where each new column or row starts in the arrays above.
Because different rows/columns have a varying number of non-zero elements, the distance between elements is completely dynamic. there is no fixed "stride" to allow for direct addressing without the lookup.
Consequently, trying to pull a sub-matrix (e.g., A(2:5, 2:5)), requires thousands of parallel threads to fetch that data simultaneously. To get to the initial element A(2, 2) a GPU thread must:
  1. Look up the Outer Pointer for column 2.
  2. Read the start pointer for column 3.
  3. Linear-search through every non-zero index inside that block just to see if row 2 exists.
Because the data are tightly packed, a thread handling row 3 cannot know where its data begins until the threads handling rows 0, 1, and 2 have resolved how many elements they contain. This creates a strictly sequential data dependency and forcing a massively parallel architecture like a GPU to execute a sequential pointer-chasing search destroys performance, making arbitrary sub-array slicing structurally infeasible.
The reason MATLAB allows full rows or full columns (depending on if it is utilizing CSR or CSC under the hood) is because the starting and ending bounds of a full slice are explicitly defined by the Outer Pointers array. The GPU can read the single pointer for the requested column and instantly copy that entire contiguous chunk of memory into a new vector.
I don't know if the MATLAB implementation documents whether CSR or CSC is used or if it may depend upon the structure of the sparsity, perhaps. I could see that when creating a particular sparse GPU array, it must choose one or the other and it then could run into the indexing problem if one asks for the opposite slice direction.
I wonder if the example might be an illustration of that; if you try the other direction
S(:,1:5)
on the same array it would succeed?
In short, GPUs are good if you can load 'em up and then they can operate on the data in parallel; anything else is likely to be nothing but a bottleneck that totally defeats the purpose.

2 Commenti

It makes me wonder then, how sparse matrix multiplication A*B is done if submatrices are so difficult to extract. Seemingly, each thread multiplies a whole row of A by a whole column of B?
dpb
dpb circa 23 ore fa
Modificato: dpb circa 2 ore fa
It uses the NVIDIA libraries which have specialized code designed specifically to use the capabilities built into the NVIDIA hardware and algorithms that calculate partial results using outer products and merge them in shared memory. The architecture also uses internal structured formats which condense storage and that align smaller sparse sections with the GPU's dense array storage tensor core hardware. So, as I understand it, the sparse matrices eventually get mapped indirectly from sparse to dense(+) in a zillion little blocks. Then parallelism gives the performance.
<A paper that may provide more insight> than the summaries I've pulled from and edited down significantly from NVIDIA developer forum and just asking google for background info.
It's a real feat of engineering, indeed, inside there.
In going back to the original question and trying to confirm if the previous surmise was at all on track, does the full row index form work on the given array for you? If that did, I think it would confirm the particular internal sparse storage used; I've yet to find anything that indicates it it always uses one or the other inside, though.
If it worked that way and not the other, it would also be reason that the documentation should be updated to reflect that slices only work in one direction based on the internal storage orientation and perhaps restricted to only one or the other universally if there is always a preferred storage orientation. The error handling might also need some revisting.
(+) I presume but did not find it stated explicitly without more extensive digging that only subarrays with a nonzero element are mapped, not reconstructing the entire array since the others wouldn't add anything to the outer product being calculated.
ADDENDUM
I see in further more careful parsing of the MATLAB doc on sparse GPU array addressing by index that the single/row constraint is documented so the above making the assumption that the give colon indexing colon expression should work was in error and so the question of orientation affecting the result is also moot.

Accedi per commentare.

Più risposte (0)

Prodotti

Release

R2024b

Richiesto:

il 3 Lug 2026 alle 19:43

Modificato:

dpb
32 minuti fa

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by