So, I haven't yet watched all these, but when the program came out, this is my hit-list of highlights for this year's FREE NVIDIA GPU Technology Conference (GTC), which ran Apr 12 - 16th, 2021. I may have simplified the titles a touch...
Content only available on-demand until May 11, 2021.
GPU and Dev Tools
- How GPU Computing Works (Modern-Day)
- Stephen Jones - CUDA Architect, NVIDIA
- We have lots of flops - latency / delay in accessing memory is now the common bottleneck
- Increasing the number of threads is a great way to fix this (and how GPUs have been architected to workaround latency)
- For operations that require all data interacting with all other data (eg. matrix multiplications), TensorCores are designed specifically for large (> 400x400) matrix multiplication, improving the upper limit - not sensible for other operations, but one way to handle (some?) all-to-all operations.
- CUDA - New Features and Beyond
- Stephen Jones - CUDA Architect, NVIDIA
- Concurrent Algorithms on CUDA
- CUDA - Latests Developer Tools
- GPU Profiling with NVIDIA Nsight
- Debugging & Analyzing CUDA Correctness
- NVIDIA C Standard Library
- Bryce Lelbach - HPC Programming Models Architect, NVIDIA
- Currently, NVCC (libcu++ - libcudacxx) - using cuda and cuda::std namespaces. Requires demarking __host__ and __device__. Does not cover all std libraries. Focus on concurrency, clocks, syscalls; and items commonly re-implemented (type_traits, tuples)
- Adding NVC++ (libnv++) -- to be released soon. Implements std.(Not ABI compatible with gcc)
- Good support for atomic (avoid using 'volatile'), and memcpy_async (asynchronous memory copying), and preps for future C++20 arrive/wait functionality
- For examples of parallelized datastructures (eg. hash maps), see cuCollections
- Future work covers future c++2* implementations, including parallel algorithms for ranges, executors, and parallel linear algebra algorithms (based off BLAS)
- Future of standard and CUDA C
- Bryce Lelbach - HPC Programming Models Architect, NVIDIA
- c++20 last year: modules, coroutines, concepts, ranges, scalable synchronization
- c++2* looking at: executors (managing and orchestrating execution), static reflection (eg. compile time knowledge of parameters, member functions), and pattern matching (regex?); also eg. float types - float16_t, etc.
- want to have language that can easily harness parallelization (eg. generally look serialized, but underhood parallelized); eg. c++17 introduced std::execution::par for flagging parallel execution
- Panel FAQ covering a variety of specific questions on c++2* and nv*
Ray Tracing
- Vulkan Ray-Tracing Tutorial
- NVIDIA's Ray Tracing Developer Tools for eg. DirectX Raytracing (DXR) and Vulkan Ray Tracing
- Future of GPU Ray Tracing
Cloud
- Intro to Cloud XR and XR Streaming
- Cloud for VR Experiences
- Global Cloud Streaming for AR
- Azure Remote Rendering
Math
- New Developments in NVIDIA Math Libraries
- FFT on the GPU
- GPUs Acceleration for Linear Algebra in Finite Element Software
- Tensor Core Accelerated Math Libraries for Dense and Sparse Linear Algebra in AI and HPC
- Graph Colouring on GPUs
- Fluid Dynamics on the GPU in C; Parallelized