NVidia Meeting 2015-3-4: Difference between revisions

From XVis
Jump to navigation Jump to search
(Created page with "== Agenda == * Weds, March 4th, 1-5pm: VTK-m design review (Ken Moreland) * Thurs, March 5th, 8am-noon: updates from NVIDIA == Design review == === Issues raised in design...")
 
No edit summary
 
(7 intermediate revisions by 2 users not shown)
Line 7: Line 7:


=== Issues raised in design review ===
=== Issues raised in design review ===
==== How to handle multiple devices with one host? ====
We discussed the VTK-m strategy for supporting multiple devices from one host (i.e., a single node of Summit).
Options presented were:
* one MPI task for each device (i.e., multiple MPI tasks per node)
** minus: may be lots of MPI tasks
** minus: may be incongruent with sim code's usage of MPI
** minus: hard boundaries between devices
** plus: easy to implement
* one MPI task per node, with (for example) threading to manage access to multiple devices
** plus: less MPI tasks
** plus: more likely to be congruent with sim code's usage of MPI
** minus: hard boundaries between devices (??)
** plus/minus: implementation easier?  (depends on details)
* one MPI tasks per node, devices are treated as one giant device
** plus: less MPI tasks
** minus: could lend itself to inefficient patterns (reaching across device memories)
* one MPI task per node, devices are knowledgable of other devices and can coordinate between each other
** plus: less MPI tasks
** plus: more likely to be congruent with sim code's usage of MPI
** plus: no boundaries between devices
** minus: big implementation, right?
==== What use case are we optimizing for? ====
We discussed what use case we were optimizing for.  It was observed that optimizing for one may be in tension with the other.
* All data located on device, memory never (rarely) transferred from device to host.
** This would be consistent with in situ usage.
* Data located on host (with big memory) and is streamed to the device (which has small memory)
** This is consistent with post-processing usage (at least this could be argued).
==== Asynchrony / streams ====
* using EAVL for comparison:
** in EAVL, you could queue up operations into a Plan, then execute the plan which would do all the kernel launches
** in theory, we could split these kernels into streams based on their dependencies
** e.g. if we use arrays a, b, and c in the plan, but c isn't needed until kernel 5, we could send c down while kernels 1-4 execute
** and, if kernel 7 were the final stage of a reduction that used only 1 SM, then if we could put kernel 8 was in a separate stream, the card could start executing kernel 8 before 7 finished
* if we could do some batching up of worklets (kernel launches) like this in VTK-m, i.e. put it all into a plan before executing it, then that may give us enough info to do all this awesome async stuff
==== CUDA JIT for dynamic array handle ====
* Steve Parker mentions that we might be able to JIT a 3-line template instantiation to avoid pre-compiling the (e.g. float/double/int===3)  3^numarg variations  with concrete types
==== virtual functions instead of combinatorial code to get to concrete types ====
* Steve: cost of a function call on GPU is very low (assuming all threads do same thing) -- only one instruction. 
** but: the register cost may be high enough to cause some penalty
==== which unstructured connectivity ====
* note that sorting cells may hurt out locality; cells are probably sorted by locality to begin with
* doing separate launches on groups of cell types cuts down not just on divergence, but potentially register usage and cache memory usage (e.g. iso table lookups for only one cell type)
* we could start with vtk-style and do a RLE search to group, or we could start with grouped and allow multiplicity to get back to vtk-style
* as such, it's largely inconclusive (and maybe not critical) which internal style we use -- we still need a "single-functor-for-all-cell-types" version of a worklet (in the event that cell types are essentially random), and we can optimize to a "one-functor-for-each-cell-type" version if it's a big enough speedup
* it's hard to know whether the benefits are worth it
* that said, memory bandwidth is probably our critical parameter -- if we could store the reverse lookup more efficiently (e.g. cellGroup[ncells/1024] versus mapCellToIndex[ncells]
* we should probably sample actual sim codes to figure out which one maps to most codes
== Next steps ==
We discussed having a short-term NVIDIA+Maynard meeting, to optimize infrastructure.
We also discussed having a second meeting with NVIDIA, and having it double with our annual PI meeting.
Options discussed were:
* over the summer, in the Bay Area
* in mid-September, since this is a good time for Hank's Ph.D. students to travel
* At Vis (this wasn't discussed much, but Berk and I discussed it again and it seems like it might be a match.)
At the second meeting we wanted to dive in on a small set of algorithms.  We discussed a likely "Top 5" (there are actually 6):
# isosurfacing
# cell-data-to-point-data / point-data-to-cell-data
# surface rendering (via ray-tracing?)
# external faces
# volume rendering
# streamlines

Latest revision as of 12:54, 5 March 2015

Agenda

  • Weds, March 4th, 1-5pm: VTK-m design review (Ken Moreland)
  • Thurs, March 5th, 8am-noon: updates from NVIDIA

Design review

Issues raised in design review

How to handle multiple devices with one host?

We discussed the VTK-m strategy for supporting multiple devices from one host (i.e., a single node of Summit). Options presented were:

  • one MPI task for each device (i.e., multiple MPI tasks per node)
    • minus: may be lots of MPI tasks
    • minus: may be incongruent with sim code's usage of MPI
    • minus: hard boundaries between devices
    • plus: easy to implement
  • one MPI task per node, with (for example) threading to manage access to multiple devices
    • plus: less MPI tasks
    • plus: more likely to be congruent with sim code's usage of MPI
    • minus: hard boundaries between devices (??)
    • plus/minus: implementation easier? (depends on details)
  • one MPI tasks per node, devices are treated as one giant device
    • plus: less MPI tasks
    • minus: could lend itself to inefficient patterns (reaching across device memories)
  • one MPI task per node, devices are knowledgable of other devices and can coordinate between each other
    • plus: less MPI tasks
    • plus: more likely to be congruent with sim code's usage of MPI
    • plus: no boundaries between devices
    • minus: big implementation, right?

What use case are we optimizing for?

We discussed what use case we were optimizing for. It was observed that optimizing for one may be in tension with the other.

  • All data located on device, memory never (rarely) transferred from device to host.
    • This would be consistent with in situ usage.
  • Data located on host (with big memory) and is streamed to the device (which has small memory)
    • This is consistent with post-processing usage (at least this could be argued).


Asynchrony / streams

  • using EAVL for comparison:
    • in EAVL, you could queue up operations into a Plan, then execute the plan which would do all the kernel launches
    • in theory, we could split these kernels into streams based on their dependencies
    • e.g. if we use arrays a, b, and c in the plan, but c isn't needed until kernel 5, we could send c down while kernels 1-4 execute
    • and, if kernel 7 were the final stage of a reduction that used only 1 SM, then if we could put kernel 8 was in a separate stream, the card could start executing kernel 8 before 7 finished
  • if we could do some batching up of worklets (kernel launches) like this in VTK-m, i.e. put it all into a plan before executing it, then that may give us enough info to do all this awesome async stuff

CUDA JIT for dynamic array handle

  • Steve Parker mentions that we might be able to JIT a 3-line template instantiation to avoid pre-compiling the (e.g. float/double/int===3) 3^numarg variations with concrete types

virtual functions instead of combinatorial code to get to concrete types

  • Steve: cost of a function call on GPU is very low (assuming all threads do same thing) -- only one instruction.
    • but: the register cost may be high enough to cause some penalty

which unstructured connectivity

  • note that sorting cells may hurt out locality; cells are probably sorted by locality to begin with
  • doing separate launches on groups of cell types cuts down not just on divergence, but potentially register usage and cache memory usage (e.g. iso table lookups for only one cell type)
  • we could start with vtk-style and do a RLE search to group, or we could start with grouped and allow multiplicity to get back to vtk-style
  • as such, it's largely inconclusive (and maybe not critical) which internal style we use -- we still need a "single-functor-for-all-cell-types" version of a worklet (in the event that cell types are essentially random), and we can optimize to a "one-functor-for-each-cell-type" version if it's a big enough speedup
  • it's hard to know whether the benefits are worth it
  • that said, memory bandwidth is probably our critical parameter -- if we could store the reverse lookup more efficiently (e.g. cellGroup[ncells/1024] versus mapCellToIndex[ncells]
  • we should probably sample actual sim codes to figure out which one maps to most codes

Next steps

We discussed having a short-term NVIDIA+Maynard meeting, to optimize infrastructure.

We also discussed having a second meeting with NVIDIA, and having it double with our annual PI meeting. Options discussed were:

  • over the summer, in the Bay Area
  • in mid-September, since this is a good time for Hank's Ph.D. students to travel
  • At Vis (this wasn't discussed much, but Berk and I discussed it again and it seems like it might be a match.)

At the second meeting we wanted to dive in on a small set of algorithms. We discussed a likely "Top 5" (there are actually 6):

  1. isosurfacing
  2. cell-data-to-point-data / point-data-to-cell-data
  3. surface rendering (via ray-tracing?)
  4. external faces
  5. volume rendering
  6. streamlines