The GPU and You

We like to think of computers as being similar to the programming language models we're used to using. For example, if we're used to Python, we imagine that the computer is working one line at a time, executing an instruction and moving on to the next one, updating objects' data in memory as it goes. If we are accustomed to C, we might be familiar with the idea of heap memory as being like a big array of values, with stack memory being a much smaller array. If you have done any work with parallel processing, you might generalize this to multiple threads of control all executing instruction streams simultaneously. While these types of models are roughly accurate for the CPU and RAM in your computer, every model is fundamentally a useful lie, and these ones are no different.

Maybe you've heard of the term GPU, or Graphics Processing Unit, which is responsible for feeding the display hardware at a rate of 60 or more images per second. Nowadays, nearly every computer has some kind of GPU—whether it takes up space in the physical die of the CPU or is a separate, discrete chip or card slotted into a PCI slot. The GPU is in effect its own tiny computer inside of your computer: it has its own RAM, its own instruction fetching and decoding logic, and everything else. From the most humble Raspberry Pi to the M1 Mac to the latest gaming PCs, our computers are in fact tiny distributed systems synchronizing work across several purpose-built units.

Both the operating system and individual applications can connect to the GPU and interact with it to draw stuff to the screen; we'll imagine for this class that both work in fundamentally the same way. When we want to send commands to the graphics card, we use a graphics API like Metal (on the Mac), DirectX (on Windows), or cross-platform alternatives like OpenGL or Vulkan. Generally, we obtain a surface from the windowing system (though the specific name varies from API to API) and then use the graphics API to tell the GPU to draw into that surface. Graphics APIs integrate with the specific operating system to transmit data to and from the specific GPU while presenting a uniform interface to applications that hides some complexity. The trend, however, has been to hide less and less complexity to shift GPU-friendly work away from CPUs to save power and time.

GPUs vs CPUs

GPUs differ from CPUs in an important way: they are much less flexible. This tradeoff allows them to be shockingly more efficient, both in terms of raw speed and in terms of power consumption per unit work. A multi-core CPU has a large variety of features—independent cache memory per core, independent instruction streams and decoding per core, often independent ALUs and memory buses and other components per core—which gives them extreme flexibility. Different cores can be doing totally different operations on completely different data simultaneously. These components both take up physical space and use energy, so there are limits to how many cores are practical. A typical multi-core CPU would feature between 2 and 8 cores, though consumer parts are available now ranging from 16 up to 64 depending on price and architecture.

It's hard to compare the term core across radically different types of devices, but a typical modern GPU will have thousands to tens of thousands of cores. GPU cores share instruction fetching and decoding in large blocks; instead of having 8 separate cores fetching from and decoding and executing 8 different instruction streams, GPUs provide the same instruction stream and program counter for every core in the block. How do they get any work done, then?

The key is that each core is working on a different chunk of the same problem, and every chunk needs the same thing done to it. This way of operating is called single instruction, multiple data processing (SIMD; CPUs also have SIMD operations, but these are much narrower than what's available on a GPU). Let's imagine we want to double the value of every element of an array (say, to turn [1,2,3,4,5] into [2,4,6,8,10]). A conventional CPU would work one element at a time—first to read the 1, then double the 1 and obtain 2, write the 2; then increase the current array index, read the 2, the 2 into 4, write the 4, and so on. A GPU would instead set up each of five processing units with an offset into the array (0, 1, 2, 3, 4) and step through the following program:

  1. Read the value of the array at my offset.
  2. Double the read value.
  3. Write the doubled value into the array at my offset.

The total number of instructions is much lower, but even more importantly the number of actual computation steps is smaller since all five units are doing those three steps simultaneously (rather than waiting to double the 1, then the 2, then the 3, and so on). Even if there is some cost to sending the data to the GPU and retrieving it back from the GPU, this clearly points to some substantial opportunities for speeding up our programs. In many cases, we can also get energy savings this way—fewer steps means less time spent running at full power.

Why are GPUs like this? Why should dedicated hardware for doing 3D graphics end up optimized for doing massively parallel homogeneous computations over huge streams of data? The answer is a happy historical accident.

The Graphics Pipeline

The traditional purpose of a GPU has been to render 3D graphics (the name is a good hint!). Specifically, GPUs take streams of vertices and compose them into triangles, which are then rasterized into sets of fragments in screen space, which are each shaded, depth-ordered, and finally rendered as pixels on the display.

In more depth:

  1. The application tells the graphics API to reserve some memory in one or more buffers on the GPU (call these buffers A, B, and C for now).

The next four steps are performed by creating a queue of commands on the CPU which are handled by the API:

  1. The application fills these memory buffers with data—maybe A is filled with vertex data defining polygons, B is filled with image data (a texture), and C is filled with vertex data saying which points in the texture correspond to which vertices.
  2. The application configures the GPU to draw to a particular output target (the color target, i.e. the screen; the buffer backing this target might be reserved by the API itself), along with how to map vertex coordinates onto screen coordinates (a vertex shader program) and how to colorize fragments (the fragment shader program).
  3. The application tells the GPU to bind buffer A as vertex positions, to sample textures from buffer B, and to progress through buffer C at the same rate as buffer A.
  4. The application tells the GPU to draw all the vertices in buffer A.

The following four steps are done in massively parallel fashion on the GPU, triggered by the API and driver implementations:

  1. The GPU runs the vertex shader on each vertex in buffer A to compute "final" vertices in screen coordinates.
  2. The GPU stitches together triangles from successive triples of vertices.
  3. The GPU figures out, for each pixel of screen space, which part of each triangle is overlapping that pixel. These parts are called fragments.
  4. Each fragment is shaded (colorized, in this case) by running the fragment shader on each fragment. The fragment shader is called with an interpolated position between three points given in buffer C, which is used to sample the texture; this sampled color is written to an output (in this case, the color target).

Finally:

  1. After all the fragments are shaded, the GPU and driver present the finished image while the application begins work on the next frame.

To sum up, the main stages that happen on the GPU are vertex shading, triangulation, rasterization, and fragment shading. The data and programs for these stages are fed in by the application code in advance of drawing, and drawing commands are issued and completed in batches. In simple applications, the program will draw what it needs to and then wait for the GPU to finish before proceeding with more work, or draw one thing, wait, draw another, wait, and so on. In more complex applications multiple command queues can be in flight simultaneously, and while one image is being presented another one can be prepared. GPUs can be used not only to render graphics, but also to compute data, and that computed data can be kept on the GPU RAM and then used for rendering later.

Lab: The Humble Textured Triangle

Phew. That's a lot of information. It turns out GPUs are pretty complicated, and it's hard to do anything without a few hundred lines of code. In this course we'll use a cross-platform graphics API called WGPU based on the WebGPU standard but compatible with a variety of web and native back-end drivers. Even though it has "web" in the name, it's inspired by Vulkan and makes similar tradeoffs of verbosity versus performance. We'll begin by making sure we can run a WGPU demo, then break that demo down into pieces and rearrange them.

Let's make a new Rust project: cargo new --bin triangle --edition 2021.

Getting WGPU Going

We'll add some dependencies we need:

[package]
name = "triangle"
version = "0.1.0"
authors = ["Joseph C. Osborn <joseph.osborn@pomona.edu>"]
edition = "2021"

[dependencies]
# Our graphics API
wgpu = "0.17"
# Opening windows in a cross-platform way
winit = "0.28"
# Organized logging output, WGPU uses this for errors and info
log = "0.4"
env_logger = "0.10"
# Pollster is a very simple async runtime. We can't ignore async since we want to be web-compatible.
pollster = "0.3.0"

Download the files main.rs and shader.wgsl from this URL and put them into triangle/src/, replacing main.rs as you go. Run it with cargo run and make sure you see a nice pretty triangle!

WGPU Triangle Demo

We'll start by showing the shaders used to render our red triangle. Don't be too intimidated by all the at-signs and angle braces—we don't need to deeply understand shaders. For now, let's just try to identify how these pieces fit into the graphics pipeline from before.

@vertex
fn vs_main(@builtin(vertex_index) in_vertex_index: u32) -> @builtin(position) vec4<f32> {
    let x = f32(i32(in_vertex_index) - 1);
    let y = f32(i32(in_vertex_index & 1u) * 2 - 1);
    return vec4<f32>(x, y, 0.0, 1.0);
}

@fragment
fn fs_main() -> @location(0) vec4<f32> {
    return vec4<f32>(1.0, 0.0, 0.0, 1.0);
}

Take a minute with a buddy and figure out these pieces of information (write them down in your lab report!):

  1. Which is the vertex shader, and which is the fragment shader?
  2. This vertex shader doesn't take in vertex information from a vertex buffer, but only vertex indices. Which index produces the top middle point of the triangle?
  3. How would you change the color of the output triangle?
  4. Does this file also explain why the background is green?

So this shader program will run on the GPU, and its inputs are evidently just the numbers zero, one, and two in sequence. We have to look at the CPU source code in main.rs to see how that happens!

I've added a bunch of comments to the triangle example below, so we'll just go through it line by line.

use std::borrow::Cow;
use winit::{
    event::{Event, WindowEvent},
    event_loop::{ControlFlow, EventLoop},
    window::Window,
};

// In WGPU, we define an async function whose operation can be suspended and resumed.
// This is because on web, we can't take over the main event loop and must leave it to
// the browser.  On desktop, we'll just be running this function to completion.
async fn run(event_loop: EventLoop<()>, window: Window) {
    let size = window.inner_size();

    // An Instance is an instance of the graphics API.  It's the context in which other
    // WGPU values and operations take place, and there can be only one.
    // Its implementation of the Default trait automatically selects a driver backend.
    let instance = wgpu::Instance::default();

    // From the OS window (or web canvas) the graphics API can obtain a surface onto which
    // we can draw.  This operation is unsafe (it depends on the window not outliving the surface)
    // and it could fail (if the window can't provide a rendering destination).
    // The unsafe {} block allows us to call unsafe functions, and the unwrap will abort the program
    // if the operation fails.
    let surface = unsafe { instance.create_surface(&window) }.unwrap();

    // Next, we need to get a graphics adapter from the instance---this represents a physical
    // graphics card (GPU) or compute device.  Here we ask for a GPU that will be able to draw to the
    // surface we just obtained.
    let adapter = instance
        .request_adapter(&wgpu::RequestAdapterOptions {
            power_preference: wgpu::PowerPreference::default(),
            force_fallback_adapter: false,
            // Request an adapter which can render to our surface
            compatible_surface: Some(&surface),
        })
        // This operation can take some time, so we await the result. We can only await like this
        // in an async function.
        .await
        // And it can fail, so we panic with an error message if we can't get a GPU.
        .expect("Failed to find an appropriate adapter");

    // Create the logical device and command queue.  A logical device is like a connection to a GPU, and
    // we'll be issuing instructions to the GPU over the command queue.
    let (device, queue) = adapter
        .request_device(
            &wgpu::DeviceDescriptor {
                label: None,
                // We don't need to ask for any optional GPU features for our simple example
                features: wgpu::Features::empty(),
                // Make sure we use very broadly compatible limits for our driver,
                // and also use the texture resolution limits from the adapter.
                // This is important for supporting images as big as our swapchain.
                limits: wgpu::Limits::downlevel_webgl2_defaults()
                    .using_resolution(adapter.limits()),
            },
            None,
        )
        // request_device is also an async function, so we need to wait for the result.
        .await
        .expect("Failed to create device");

    // The swapchain is how we obtain images from the surface we're drawing onto.
    // This is so we can draw onto one image while a different one is being presented
    // to the user on-screen.
    let swapchain_capabilities = surface.get_capabilities(&adapter);
    // We'll just use the first supported format, we don't have any reason here to use
    // one format or another.
    let swapchain_format = swapchain_capabilities.formats[0];

    // Our surface config lets us set up our surface for drawing with the device
    // we're actually using.  It's mutable in case the window's size changes later on.
    let mut config = wgpu::SurfaceConfiguration {
        usage: wgpu::TextureUsages::RENDER_ATTACHMENT,
        format: swapchain_format,
        width: size.width,
        height: size.height,
        present_mode: wgpu::PresentMode::Fifo,
        alpha_mode: swapchain_capabilities.alpha_modes[0],
        view_formats: vec![],
    };
    surface.configure(&device, &config);

    // Load the shaders from disk.  Remember, shader programs are things we compile for
    // our GPU so that it can compute vertices and colorize fragments.
    let shader = device.create_shader_module(wgpu::ShaderModuleDescriptor {
        label: None,
        // Cow is a "copy on write" wrapper that abstracts over owned or borrowed memory.
        // Here we just need to use it since wgpu wants "some text" to compile a shader from.
        source: wgpu::ShaderSource::Wgsl(Cow::Borrowed(include_str!("shader.wgsl"))),
    });

    // A graphics pipeline is sort of like the conventions for a function call: it defines
    // the shapes of arguments (bind groups and push constants) that will be used for
    // draw calls.
    let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
        label: None,
        bind_group_layouts: &[],
        push_constant_ranges: &[],
    });

    // Our specific "function" is going to be a draw call using our shaders. That's what we
    // set up here, calling the result a render pipeline.  It's not only what shaders to use,
    // but also how to interpret streams of vertices (e.g. as separate triangles or as a list of lines),
    // whether to draw both the fronts and backs of triangles, and how many times to run the pipeline for
    // things like multisampling antialiasing.
    let render_pipeline = device.create_render_pipeline(&wgpu::RenderPipelineDescriptor {
        label: None,
        layout: Some(&pipeline_layout),
        vertex: wgpu::VertexState {
            module: &shader,
            entry_point: "vs_main",
            buffers: &[],
        },
        fragment: Some(wgpu::FragmentState {
            module: &shader,
            entry_point: "fs_main",
            targets: &[Some(swapchain_format.into())],
        }),
        primitive: wgpu::PrimitiveState::default(),
        depth_stencil: None,
        multisample: wgpu::MultisampleState::default(),
        multiview: None,
    });

    // Now our setup is all done and we can kick off the windowing event loop.
    // This closure is a "move closure" that claims ownership over variables used within its scope.
    // It is called once per iteration of the event loop.
    event_loop.run(move |event, _, control_flow| {
        // By default, tell the windowing system that there's no more work to do
        // from the application's perspective.
        *control_flow = ControlFlow::Poll;
        // Depending on the event, we'll need to do different things.
        // There is some pretty fancy pattern matching going on here,
        // so think back to CSCI054.
        match event {
            Event::WindowEvent {
                // For example, "if it's a window event and the specific window event is that
                // we have resized the window to a particular new size called `size`..."
                event: WindowEvent::Resized(size),
                // Ignoring the rest of the fields of Event::WindowEvent...
                ..
            } => {
                // Reconfigure the surface with the new size
                config.width = size.width;
                config.height = size.height;
                surface.configure(&device, &config);
                // On MacOS the window needs to be redrawn manually after resizing
                window.request_redraw();
            }
            Event::MainEventsCleared => {
                // If the window system is telling us to redraw, let's get our next swapchain image
                let frame = surface
                    .get_current_texture()
                    .expect("Failed to acquire next swap chain texture");
                // And set up a texture view onto it, since the GPU needs a way to interpret those
                // image bytes for writing.
                let view = frame
                    .texture
                    .create_view(&wgpu::TextureViewDescriptor::default());
                // From the queue we obtain a command encoder that lets us issue GPU commands
                let mut encoder =
                    device.create_command_encoder(&wgpu::CommandEncoderDescriptor { label: None });
                {
                    // Now we begin a render pass.  The descriptor tells WGPU that
                    // we want to draw onto our swapchain texture view (that's where the colors will go)
                    // and that there's no depth buffer or stencil buffer.
                    let mut rpass = encoder.begin_render_pass(&wgpu::RenderPassDescriptor {
                        label: None,
                        color_attachments: &[Some(wgpu::RenderPassColorAttachment {
                            view: &view,
                            resolve_target: None,
                            ops: wgpu::Operations {
                                // When loading this texture for writing, the GPU should clear
                                // out all pixels to a lovely green color
                                load: wgpu::LoadOp::Clear(wgpu::Color::GREEN),
                                // The results of drawing should always be stored to persistent memory
                                store: true,
                            },
                        })],
                        depth_stencil_attachment: None,
                    });
                    // And this is where the magic happens: we tell the driver to set up the GPU
                    // with our drawing program (our render pipeline)...
                    rpass.set_pipeline(&render_pipeline);
                    // Then execute that program to draw vertices 0, 1, and 2 for a single instance
                    // (instancing lets the GPU draw the same vertices over and over again, but with
                    // different auxiliary instance data for each trip through the batch of vertices).
                    // If we had a vertex buffer bound, this would fetch vertex data from that buffer,
                    // but for this example we aren't doing that.
                    rpass.draw(0..3, 0..1);
                }
                // Once the commands have been scheduled, we send them over to the GPU via the queue.
                queue.submit(Some(encoder.finish()));
                // Then we wait for the commands to finish and tell the windowing system to
                // present the swapchain image.
                frame.present();
            }
            // If we're supposed to close the window, tell the event loop we're all done
            Event::WindowEvent {
                event: WindowEvent::CloseRequested,
                ..
            } => *control_flow = ControlFlow::Exit,
            // Ignore every other event for now.
            _ => {}
        }
    });
}

// Main is just going to configure an event loop, open a window, set up logging,
// and kick off our `run` function.
fn main() {
    let event_loop = EventLoop::new();
    let window = winit::window::Window::new(&event_loop).unwrap();
    #[cfg(not(target_arch = "wasm32"))]
    {
        env_logger::init();
        // On native, we just want to wait for `run` to finish.
        pollster::block_on(run(event_loop, window));
    }
    #[cfg(target_arch = "wasm32")]
    {
        // On web things are a little more complicated.
        std::panic::set_hook(Box::new(console_error_panic_hook::hook));
        console_log::init().expect("could not initialize logger");
        use winit::platform::web::WindowExtWebSys;
        // On wasm, append the canvas to the document body
        web_sys::window()
            .and_then(|win| win.document())
            .and_then(|doc| doc.body())
            .and_then(|body| {
                body.append_child(&web_sys::Element::from(window.canvas()))
                    .ok()
            })
            .expect("couldn't append canvas to document body");
        // Now we use the browser's runtime to spawn our async run function.
        wasm_bindgen_futures::spawn_local(run(event_loop, window));
    }
}

Phew! There are a ton of little steps to keep track of when working with GPUs. Most of the code is describing how to connect the CPU's state with the GPU's state, how to set up and retrieve the results of GPU operations, and so on. Some of that code is only needed once per program (e.g., initializing the adapter and swapchain), some of it is needed once for each sort of object being drawn (e.g., setting up render pipelines), some of it is needed once or more per frame (initializing a render pass), and some is necessary for every individual object being drawn (pipeline changes and draw calls).

Take a deep breath, shake out your arms and legs, and look for answers to these four questions in the code above. Write them in your lab report.

  1. Write down four jargon terms you didn't know before and briefly define them (if you knew them all, look through the WGPU documentation for more until you get to four).
  2. Can you think of a situation where you would want to use multiple render pipelines within a single render pass?
  3. Can you think of a situation where you would want to use multiple render passes during one frame?
  4. If we wanted to draw two triangles (a quadrilateral) instead of one, in what two places would we need to change the code?

Texturing our Triangle

Now we want to put a pretty picture onto our triangle. First, we need to load the image on the CPU side so we can later send it to the GPU. Make a new folder next to your src folder in the triangle project and call it content. Go ahead and right-click/save-as this image into that new folder.

47.png

Uploading an Image to the GPU

We're going to use a Rust crate called image for our image-loading needs. We can cargo add image to get started on that, but for this crate in particular I like to add a few lines of configuration to our Cargo.toml to make sure image and its key dependencies are compiled with optimizations (following the Cargo book's section on profiles). Add these lines at the end of your Cargo.toml:

[profile.dev.package.backtrace]
opt-level = 3
[profile.dev.package.image]
opt-level = 3
[profile.dev.package.png]
opt-level = 3
[profile.dev.package.adler]
opt-level = 3
[profile.dev.package.miniz_oxide]
opt-level = 3

Image provides a very convenient API for loading images from files. The next few code blocks are just informative, you don't need to put them into the file we're working on!

use std::path::Path;
let img = image::open(Path::new("content/47.png")).expect("Should be a valid image at path content/47.png'");

We can do something like convert it into raw image bytes quite simply:

let img = img.to_rgba8();

So that's our image on the CPU, ready to upload to the GPU. In GPU-land, images are represented as textures since in general they may be stretched or sliced across differently shaped objects. We already saw a little bit of texturing-related code above: the TextureView which wrapped around the swapchain image to describe its dimensions and format. In order to use an image as an input for texturing (a thing we can do in a fragment shader), we have to both put the image data on the GPU and define a sampler which describes how to pull color data from the image.

First, we'll show how to allocate a texture resource and copy our image data over to the GPU. This is an asynchronous process that we kick off using the GPU's command queue (we'll use the write_texture function).

let (img_w, img_h) = img.dimensions();
// How big is the texture in GPU memory?
let size = wgpu::Extent3d {
    width: img_w,
    height: img_h,
    depth_or_array_layers: 1,
};
// Let's make a texture now
let texture = device.create_texture(
    // Parameters for the texture
    &wgpu::TextureDescriptor {
        // An optional label
        label: Some("47 image"),
        // Its dimensions. This line is equivalent to size:size
        size,
        // Number of mipmapping levels (to show different pictures at different distances)
        mip_level_count: 1,
        // Number of samples per pixel in the texture. It'll be one for our whole class.
        sample_count: 1,
        // Is it a 1D, 2D, or 3D texture?
        dimension: wgpu::TextureDimension::D2,
        // 8 bits per component, four components per pixel, unsigned, normalized in 0..255, SRGB
        format: wgpu::TextureFormat::Rgba8UnormSrgb,
        // This texture will be bound for shaders and have stuff copied to it
        usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
        // What formats are allowed as views on this texture besides the native format
        view_formats: &[],
    }
);
// Now that we have a texture, we need to copy its data to the GPU:
queue.write_texture(
    // A description of where to write the image data.
    // We'll use this helper to say "the whole texture"
    tex.as_image_copy(),
    // The image data to write
    &img,
    // What portion of the image data to copy from the CPU
    wgpu::ImageDataLayout {
        // Where in img do the bytes to copy start?
        offset: 0,
        // How many bytes in each row of the image?
        bytes_per_row: Some(4*img_w),
        // We could pass None here and it would be alright,
        // since we're only uploading one image
        rows_per_image: Some(img_h),
    },
    // What portion of the texture we're writing into
    size
)

According to the documentation, this texture write is deferred until "the start of the next submit() call". That will definitely happen before the actual drawing commands are executed, so we're essentially done with setting up our data now.

Pause for a second. That was a bunch of code, but where should it all go? This code requires that we have a device and a queue, but it's sort of separate from the code that sets up render pipelines. It might make sense to write ourselves a helper function to load a texture, given a device and queue and file name. Now start putting code in your file again:

// AsRef means we can take as parameters anything that cheaply converts into a Path,
// for example an &str.
fn load_texture(path:impl AsRef<std::path::Path>, label:Option<&str>,
                device:&wgpu::Device, queue:&wgpu::Queue
    ) -> Result<wgpu::Texture,image::ImageError> {
    // This ? operator will return the error if there is one, unwrapping the result otherwise.
    let img = image::open(path.as_ref())?.to_rgba8();
    let (width, height) = img.dimensions();
    let size = wgpu::Extent3d {
        width,
        height,
        depth_or_array_layers: 1,
    };
    let texture = device.create_texture(&wgpu::TextureDescriptor {
        label,
        size,
        mip_level_count: 1,
        sample_count: 1,
        dimension: wgpu::TextureDimension::D2,
        format: wgpu::TextureFormat::Rgba8UnormSrgb,
        usage: wgpu::TextureUsages::TEXTURE_BINDING | wgpu::TextureUsages::COPY_DST,
        view_formats: &[],
    });
    queue.write_texture(
        texture.as_image_copy(),
        &img,
        wgpu::ImageDataLayout {
            offset: 0,
            bytes_per_row: Some(4 * width),
            rows_per_image: Some(height),
        },
        size,
    );
    Ok(texture)
}

Now we can call load_texture("content/47.png", Some("47 image")) from our run function, either just after creating the adapter and queue or just before kicking off the event loop:

let tex_47 = load_texture("content/47.png", Some("47 image"), &device, &queue)
    .expect("Couldn't load 47 img");
let view_47 = tex_47.create_view(&wgpu::TextureViewDescriptor::default());
let sampler_47 = device.create_sampler(&wgpu::SamplerDescriptor::default());

Using a Texture During Rendering

Now that we have our texture data on the GPU, we need to modify our rendering code to actually make use of it. This requires changes in three places:

  1. The render pipeline layout needs to be different: now it needs a "slot" for a texture and its sampler, in the form of a bind group layout.
  2. A bind group must be created with the texture and its sampler and bound within our render pass.
  3. Our shader programs need to make use of the texture and sampler.

Let's go to our shaders first, so that we know how the texture is to be used. We're going to define our vertices and texture coordinates directly in the shader code for this example, so that we don't have to introduce any more WGPU concepts than absolutely necessary. While we're here, we'll also render a quad instead of a triangle, so we need to change our rpass.draw call to use the range 0..6 instead of 0..3 (two triangles instead of one).

The notation for constant arrays of vertices is kind of horrible, but don't worry—this is just a placeholder for now.

// Where is the nth vertex in normalized device coordinates?
var<private> VERTICES:array<vec4<f32>,6> = array<vec4<f32>,6>(
    // In WGPU, the bottom left corner is -1,-1 and the top right is 1,1.
    vec4<f32>(-1., -1., 0., 1.),
    vec4<f32>(1., -1., 0., 1.),
    vec4<f32>(-1., 1., 0., 1.),
    vec4<f32>(-1., 1., 0., 1.),
    vec4<f32>(1., -1., 0., 1.),
    vec4<f32>(1., 1., 0., 1.)
);

// How does each vertex map onto the texture's corners?
var<private> TEX_COORDS:array<vec2<f32>,6> = array<vec2<f32>,6>(
    // Texture coordinates are a bit different---they go from 0,0 at the top left to 1,1 at the bottom right,
    // but if they are outside that bound they may clamp, or repeat the texture, or something else
    // depending on the sampler.
    vec2<f32>(0., 1.),
    vec2<f32>(1., 1.),
    vec2<f32>(0., 0.),
    vec2<f32>(0., 0.),
    vec2<f32>(1., 1.),
    vec2<f32>(1., 0.)
);

// Now we're outputting more than just a position,
// so we'll define a struct
struct VertexOutput {
    @builtin(position) clip_position: vec4<f32>,
    @location(0) tex_coords: vec2<f32>,
}

// vs_main now produces an instance of that struct...
@vertex
fn vs_main(@builtin(vertex_index) in_vertex_index: u32) -> VertexOutput {
    // We'll just look up the vertex data in those constant arrays
    return VertexOutput(
        VERTICES[in_vertex_index],
        TEX_COORDS[in_vertex_index]
    );
}

// Now our fragment shader needs two "global" inputs to be bound:
// A texture...
@group(0) @binding(0)
var t_diffuse: texture_2d<f32>;
// And a sampler.
@group(0) @binding(1)
var s_diffuse: sampler;
// Both are in the same binding group here since they go together naturally.

// Our fragment shader takes an interpolated `VertexOutput` as input now
@fragment
fn fs_main(in:VertexOutput) -> @location(0) vec4<f32> {
    // And we use the tex coords from the vertex output to sample from the texture.
    return textureSample(t_diffuse, s_diffuse, in.tex_coords);
}

So that's where we need to end up: we have to tell the render pass to use the texture we defined earlier along with a sampler during draw calls. For now, let's back up to our render pipeline layout—we need to set it up to take a single bind group, which means we need to define the shape of those groups to match what's expected in the shader. This is similar to describing a type, a sort of contract between the application and the GPU.

let texture_bind_group_layout =
    device.create_bind_group_layout(&wgpu::BindGroupLayoutDescriptor {
        label: None,
        // This bind group's first entry is for the texture and the second is for the sampler.
        entries: &[
            // The texture binding
            wgpu::BindGroupLayoutEntry {
                // This matches the binding number in the shader
                binding: 0,
                // Only available in the fragment shader
                visibility: wgpu::ShaderStages::FRAGMENT,
                // It's a texture binding
                ty: wgpu::BindingType::Texture {
                    // We can use it with float samplers
                    sample_type: wgpu::TextureSampleType::Float { filterable: true },
                    // It's being used as a 2D texture
                    view_dimension: wgpu::TextureViewDimension::D2,
                    // This is not a multisampled texture
                    multisampled: false,
                },
                // This is not an array texture, so it has None for count
                count: None,
            },
            // The sampler binding
            wgpu::BindGroupLayoutEntry {
                // This matches the binding number in the shader
                binding: 1,
                // Only available in the fragment shader
                visibility: wgpu::ShaderStages::FRAGMENT,
                // It's a sampler
                ty: wgpu::BindingType::Sampler(wgpu::SamplerBindingType::Filtering),
                // No count
                count: None,
            },
        ],
    });
// Now we'll create our pipeline layout, specifying the shape of the execution environment (the bind group)
let pipeline_layout = device.create_pipeline_layout(&wgpu::PipelineLayoutDescriptor {
    label: None,
    bind_group_layouts: &[&texture_bind_group_layout],
    push_constant_ranges: &[],
});

The render pipeline creation is exactly the same, since it uses the pipeline layout we just defined. We still need to package up an execution environment—a bind group—with our texture view and sampler from before. We can do this anytime after defining the bind group layout; you can imagine that we're creating (on the GPU) an instance of a data structure.

let tex_47_bind_group = device.create_bind_group(&wgpu::BindGroupDescriptor {
    label: None,
    layout: &texture_bind_group_layout,
    entries: &[
        // One for the texture, one for the sampler
        wgpu::BindGroupEntry {
            binding: 0,
            resource: wgpu::BindingResource::TextureView(&view_47),
        },
        wgpu::BindGroupEntry {
            binding: 1,
            resource: wgpu::BindingResource::Sampler(&sampler_47),
        },
    ],
});

The last piece of code we need to change is just after we set our render pipeline on our render pass:

rpass.set_pipeline(&render_pipeline);
// Attach the bind group for group 0
rpass.set_bind_group(0, &tex_47_bind_group, &[]);
// Now draw two triangles!
rpass.draw(0..6, 0..1);

Wrapping Up

Phew.

A few hundred lines of code later, we can draw a rectangle on the screen with a picture mapped onto it. At this point, we could proceed in a variety of ways to make moving objects appear on the screen:

  1. Create a vertex buffer to tell the GPU what triangles to draw, rather than hard-coding them in the shader; we could update that buffer every frame.
  2. Leave the vertices hard coded, but modify the image data and call write_texture every frame to update what's being shown on the screen.
  3. Create a storage buffer to tell the GPU where to draw sprites, and have the shader figure out the vertices; we could update that buffer every frame.

Each of these has trade-offs and benefits. We'll dig into option (1) when we get into the 3D segment of our course. Next time we'll explore (2) and (3) as opposites in spirit: whereas (2) leaves all the drawing up to the CPU, (3) puts it all in the GPU's hands; while (2) is general enough to implement a 3D renderer purely in software, (3) is specialized to the use case of drawing quadrilateral sprites.

For now, go back over the code above and figure out where to put it into your old triangle code. Once you have the 47 example running, show it to me or a TA along with your lab report.