Sunday, May 18, 2014

[HLSL] Turning float4's into a float4x4

In one of my vertex shaders I need to turn a couple float4's into a float4x4. Specifically, I'm building a world matrix. (For those that are curious, it's instance data. The design I'm using is very similar to the one put forth by DICE on slide 29)

If the float4's are rows, then building a float4x4 is really easy:

float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) {
    return float4x4(r0, r1, r2, r3);
}

float4x4 has a constructor that takes vectors as arguments. However, as you can see above, it assumes these are rows. I find this a bit odd since internally, DirectX, by default stores matrices as column-major. Therefore, behind the scenes, it will have to do lots of swizzle copy-swaps.

If the float4's are columns, building the float4x4 becomes a bit more icky for our viewing, since we have to manually pick off each element and send it to the full float4x4 constructor. However, I suspect behind the scenes the compiler will know better.

float4x4 CreateMatrixFromCols(float4 c0, float4 c1, float4 c2, float4 c3) {
    return float4x4(c0.x, c1.x, c2.x, c3.x,
                    c0.y, c1.y, c2.y, c3.y,
                    c0.z, c1.z, c2.z, c3.z,
                    c0.w, c1.w, c2.w, c3.w);
}


I wasn't happy with just guessing what the compiler would and wouldn't do, so I created a simple HLSL vertex shader to see how many OPs each function produced. hlsli_util.hlsli contains the two functions defined above. (Yes, I know the position isn't being transformed to clip space. It's just a trivial shader)

#include "hlsl_util.hlsli"

cbuffer cbPerObject : register(b1) {
    uint gStartVector;
    uint gNumVectorsPerInstance;
};

StructuredBuffer<float4> gInstanceBuffer : register(t0);


float4 main(float3 pos : POSITION, uint instanceId : SV_INSTANCEID) : SV_POSITION {
    uint worldMatrixOffset = instanceId * gNumVectorsPerInstance + gStartVector;

    float4 c0 = gInstanceBuffer[worldMatrixOffset];
    float4 c1 = gInstanceBuffer[worldMatrixOffset + 1];
    float4 c2 = gInstanceBuffer[worldMatrixOffset + 2];
    float4 c3 = gInstanceBuffer[worldMatrixOffset + 3];

    float4x4 instanceWorldCol = CreateMatrixFromCols(c0, c1, c2, c3);
    //float4x4 instanceWorldRow = CreateMatrixFromRows(c0, c1, c2, c3);

    return mul(float4(pos, 1.0f), instanceWorldCol);    
}

I compiled the shader as normal and then used the following command to disassemble the compiled byte code:

fxc.exe /dumpbin /Fc <outputfile.txt> <compiledshader.cso> 


HUGE DISCLAIMER: This is the intermediate asm that fxc creates. The final number/form of OPs will depend on the final compile done by the graphics driver. However, I feel the intermediate asm will generally be close-ish to what is finally produced, and therefore, can be used as a rough gauge.


Here is the asm code for creating the matrix from columns. I'll include the register signature for this one. The other asm code samples use the same register signature.

// Resource Bindings:
//
// Name                                 Type  Format         Dim Slot Elements
// ------------------------------ ---------- ------- ----------- ---- --------
// gInstanceBuffer                   texture  struct         r/o    0        1
// cbPerObject                       cbuffer      NA          NA    1        1
//
//
// Input signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// POSITION                 0   xyz         0     NONE   float   xyz 
// SV_INSTANCEID            0   x           1   INSTID    uint   x   
//
//
// Output signature:
//
// Name                 Index   Mask Register SysValue  Format   Used
// -------------------- ----- ------ -------- -------- ------- ------
// SV_POSITION              0   xyzw        0      POS   float   xyzw
//

imad r0.x, v1.x, cb1[8].y, cb1[8].x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0)
mov r2.xyz, v0.xyzx
mov r2.w, l(1.000000)
dp4 o0.x, r2.xyzw, r1.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
dp4 o0.y, r2.xyzw, r1.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.y, l(0), t0.xyzw
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw
dp4 o0.w, r2.xyzw, r0.xyzw
dp4 o0.z, r2.xyzw, r1.xyzw
ret 
// Approximately 13 instruction slots used

Creating the matrix from columns was actually very clean. The compiler knew what we wanted and completely got rid of all the swizzles, and rather just directly copied each column and did a dot product to get the final position.

Here is the asm for creating the matrix from rows:

imad r0.x, v1.x, cb1[8].y, cb1[8].x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r1.xyzw, r0.x, l(0), t0.xyzw
iadd r0.xyz, r0.xxxx, l(1, 2, 3, 0)
mov r2.x, r1.x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r3.xyzw, r0.x, l(0), t0.xzyw
mov r2.y, r3.x
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r4.xyzw, r0.y, l(0), t0.xywz
ld_structured_indexable(structured_buffer, stride=16)(mixed,mixed,mixed,mixed) r0.xyzw, r0.z, l(0), t0.xyzw
mov r2.z, r4.x
mov r2.w, r0.x
mov r5.xyz, v0.xyzx
mov r5.w, l(1.000000)
dp4 o0.x, r5.xyzw, r2.xyzw
mov r2.y, r3.z
mov r2.z, r4.y
mov r2.w, r0.y
mov r2.x, r1.y
dp4 o0.y, r5.xyzw, r2.xyzw
mov r4.y, r3.w
mov r3.z, r4.w
mov r3.w, r0.z
mov r4.w, r0.w
mov r3.x, r1.z
mov r4.x, r1.w
dp4 o0.w, r5.xyzw, r4.xyzw
dp4 o0.z, r5.xyzw, r3.xyzw
ret 
// Approximately 27 instruction slots used

Wow! Look at all those mov OPs. So even though the HLSL constructor expects rows, giving it rows leads to huge number of mov's because the GPU uses column-major matrix representation.

I also tried manually specifying the swizzles to see if that would help:

float4x4 CreateMatrixFromRows(float4 r0, float4 r1, float4 r2, float4 r3) {
    return float4x4(r0.x, r0.y, r0.z, r0.w,
                    r1.x, r1.y, r1.z, r1.w,
                    r2.x, r2.y, r2.z, r2.w,
                    r3.x, r3.y, r3.z, r3.w);
}

However, the asm generated was identical to the constructor with 4 row vectors.

So I guess the lesson learned here today is to always try to construct matrices in the way that they are stored. I want to mention that you can tell the compiler to use Row-major matrices, but Col-major matrices are generally favored because it can simplify some matrix math.

My question to you all is: Are there better ways to do either of these two operations? As always, feel free to comment, ask questions, correct any errors I may have made, etc.

Happy coding
-RichieSams

Tuesday, May 6, 2014

How to do you implement Geometry Instancing?

So at some point in the graphics pipeline, you have a list of models that need to rendered. Normal scenario:
  1. Set the CBuffer variables
  2. Call DrawIndexed

Simple. Ok next scenario is if the models are instanced. Yes I can use DrawIndexedInstanced, but my question is: What's the best way to send the instance data to the GPU?

So far, I can think of 3 ways: 

Option 1 - Storing and using one Instance Buffer per model

The render loop would then be something like this:

for (model in scene) {
    if (model.hasInstances()) {
        if (model.isDynamic()) {
            model.UpdateInstanceBuffer(....);
        }
        model.DrawIndexedInstanced(....);
    } else {
        model.DrawIndexed(...);
    }
}

Option 2 - Using a single Instance buffer for the entire scene and updating it for each draw call

The render loop would then be something like this:

for (model in scene) {
    if (model.hasInstances()) {
        UpdateInstanceBuffer(&model.InstanceData, ....);

        model.DrawIndexedInstanced(....);
    } else {
        model.DrawIndexed(...);
    }
}

Option 3 - Caching instances into a buffer for an entire batch (or if memory requirements aren't a problem, the whole frame). Directly inspired by Battlefield 3 - slide 30.

The render loop would then be something like this:

std::vector instanceData;
std::vector offsets;

for (instancedModel in scene) {
    offsets.push_back(instanceData.size());
    for (float4 data in model.InstanceData) {
        
    }
}

BindInstanceBufferAsVSResource(&instanceData);

uint instanceOffset = 0;
for (uint i = 0; i < scene.size(); ++i) {
    UpdateVSCBuffer(offsets[i], ....);

    model.DrawIndexedInstanced(....);
}

Pros and Cons: 

Option 1 - Individual InstanceBuffers per model
Pros:
  1. Static instancing is all cached, ie. you only have to map/update/unmap a buffer once.
Cons:
  1. A ton of instance buffers. I may be over thinking things, but this seems like a lot of memory. Especially since all buffers are static size. So you either have to define exactly how many instances of an object can exist, or include some extra memory for wiggle room.
Option 2 - Single InstanceBuffer for all models
Pros:
  1. Only one instance buffer. Potentially a much smaller memory footprint than Option 1. However, we need it to be as large as our largest number of instances.
Cons:
  1. Requires a map/update/unmap for every model that needs to be instanced. I have no idea if this is expensive or not.
Option 3 - CBuffer array with all the instances for a frame/batch
Pros:
  1. Much less map/update/unmap than Option 2
  2. Can support multiple types of instance buffers (as long as they are multiples of float4)
Cons:
  1. Static instances still need to be update every frame. 
  2. Indexes out of a cbuffer. (Can cause memory contention)

So those are my thoughts. What are your thoughts? Would you choose any of the three options? Or is there a better option? Let me know if the comments below or on Twitter or with a Pastebin/Gist. 

Happy coding
-RichieSams

Monday, May 5, 2014

Loading more interesting scenes - Part 2: The Halfling Model File Format

        Well, it's been quite a long time since my last post. School is in the last week and I've been quite busy, but you don't want to hear about that. You're here to see what I've been working on.

        In my last post I finished by showing how I loaded obj models directly into the engine. I also complained that it was taking a horrendously long time to load (especially for Debug builds). I looked around for faster ways to load obj's, but there really weren't any... (sort of*) Why aren't there any obj loader libraries?
*There is assimp, but I'll get to that further down

        One answer would be that OBJs weren't designed for run-time model loading. Computers don't like parsing text. They would rather read binary; things have set sizes and can be read in chunks rather than single characters at a time. So next I looked around for a binary file format that would be faster to load. "Why re-invent the wheel" I thought?

        The problem is that standardized run-time binary file formats don't really exist either. This when it really dawned on me. For run-time, there's no point in storing things your engine doesn't need. And more than that, it would be great if the data you store is in the correct format for your engine. For example, you could store the raw vertex buffer data so you can directly cast it into DirectX vertex buffer data. Obviously, it would be extremely hard to get people to agree upon a set standard of what is "necessary", so it's common practice to have a specific binary file format for the engine that is specifically tailored to make loading the data as fast and easy as possible.

        Therefore, I set out to make my own binary file format, which, to stay with the Halfling theme, I dubbed the 'Halfling Model File'. Every indent represents a member variable of the level above it. 'String data' and 'Subset data' are arrays. (The format of the blog template makes the following table a bit hard to read. There is an ASCII version of the table here if that's easier to read)

Item Type Required Description
File Id '\0FMH' T Little-endian "HMF\0"
File format version byte T Version of the HMF format that this file uses
Flags uint64 T Bitwise-OR of flags used in the file. See the flags below
String Table F
        Num strings uint32 T The number of strings in the table
        String data T
                String length uint16 T Length of the string
                String char[] T The string characters. DOES NOT HAVE A NULL TERMINATION
Num Vertices uint32 T The number of vertices in the file
Num Indices uint32 T The number of indices in the file
NumVertexElements uint16 T The number of elements in the vertex description
Vertex Buffer Desc D3D11_BUFFER_DESC T A hard cast of the vertex buffer description
Index Buffer Desc D3D11_BUFFER_DESC T A hard cast of the index buffer description 
Instance Buffer Desc D3D11_BUFFER_DESC F A hard cast of the instance buffer description 
Vertex data void[] T Will be read in a single block using VertexBufferDesc.ByteWidth
Index data void[] T Will be read in a single block using IndexBufferDesc.ByteWidth
Instance buffer data void[] F Will be read in a single block using InstanceBufferDesc.ByteWidth
Num Subsets uint32 T The number of subsets in the file
Subset data Subset[] T Will read in a single block to a Subset[]
        Vertex Start uint64 T The index to the first vertex used by the subset
        Vertex Count uint64 T The number of vertices used by the subset (All used vertices must be in the range VertexStart + VertexCount)
        Index Start uint64 T The index to the first index used by the subset
        Index Count uint64 T The number of indices used by the subset (All used indices must be in the range IndexStart + IndexCount)
        Material Ambient Color float[3] T The RGB ambient color values of the material
        Material Specular Intensity float T The Specular Intensity
        Material Diffuse Color float[4] T The RGBA diffuse color values of the material
        Material Specular Color float[3] T The RGB specular color values of the material
        Material Specular Power float T The Specular Power
        Diffuse Color Map Filename int32 T An index to the string table. -1 if it doesn't exist.
        Specular Color Map Filename int32 T An index to the string table. -1 if it doesn't exist.
        Specular Power Map Filename int32 T An index to the string table. -1 if it doesn't exist.
        Alpha Map Filename int32 T An index to the string table. -1 if it doesn't exist.
        Bump Map Filename int32 T An index to the string table. -1 if it doesn't exist. Mutually exclusive with Normal Map
        Normal Map Filename int32 T An index to the string table. -1 if it doesn't exist. Mutually exclusive with Bump Map


        I designed the file format to make it as easy as possible to cast large chunks of memory directly from hard disk to arrays or usable engine structures. For example, the subset data is read in one giant chunk and cast directly to an array.

        There's only one problem: Binary is not really human-readable. It would be extremely arduous to create a HMF file manually, so I created a tool to automate the task. While my hand-written obj-parser fulfilled its purpose, it's was pretty bare-bones and made quite a few assumptions. Rather than spend the time to beef it up to what was necessary, I leveraged the wonderful tool ASSIMP. ASSIMP is a C++ library for loading arbitrary model file formats into a standard internal representation. It also has a number of algorithms for optimizing the model data. For example, calculating normals, triangulating meshes, or removing duplicate vertices. Therefore, I use ASSIMP to load and optimize the model, then I output ASSIMP's mesh data to the HMF format. The source code is a bit too long to directly post here, so instead I'll link you to it on GitHub. I'll also point you to a pre-compiled binary of the tool here.

        As I was writing the the code for the tool, it became apparent that I needed a way for the user to tell the tool certain parameters about the mode. For example, what textures do you want to use? I could have passed these in with command line arguments, but that's not very readable. Therefore, I put all the possible arguments into an ini file and then have the user pass the path to the ini file in as a command line arg. Below is the ini file for the sponza.obj model:

[Post-Processing]
; If normals already exist, setting GenNormals to true will do nothing
GenNormals = true
; If tangents already exist, setting GenNormals to true will do nothing
CalcTangents = true

; The booleans represent a high level override for these material properties.
; If the boolean is false, the property will be set to NULL, even if the property 
; exists within the input model file
; If the boolean is true, but the value doesn't exist within the input model file, 
; the property will be set to NULL
[MaterialPropertyOverrides]
AmbientColor = true
DiffuseColor = true
SpecColor = true
Opacity = true
SpecPower = true
SpecIntensity = true

; The booleans represent a high level override for these textures.
; If the boolean is false, the texture will be excluded, even if the texture
; exists within the input model file
; If the boolean is true, but the texture doesn't exist within the input model
; file properties, the texture will still be excluded
[TextureOverrides]
DiffuseColorMap = true
NormalMap = true
DisplacementMap = true
AlphaMap = true
SpecColorMap = true
SpecPowerMap = true

; Usages can be 'default', 'immutable', 'dynamic', or 'staging'
; In the case of a mis-spelling, immutable is assumed
[BufferDesc]
VertexBufferUsage = immutable
IndexBufferUsage = immutable

; TextureMapRedirects allow you to interpret certain textures as other kinds
; For example, OBJ doesn't directly support normal maps. Often, you will then see
; the normal map in the height (bump) map slot. These options allow you to specify
; what texture goes where.
;
; Any Maps that are excluded are treated as mapping to their own kind
; IE. excluding DiffuseColorMap is interpreted as:
;       DiffuseColorMap = diffuse
;
; The available kinds are: 'diffuse', 'normal', 'height', 'displacement', 'alpha', 
; 'specColor', and 'specPower'
[TextureMapRedirects]
DiffuseColorMap = diffuse
NormalMap = height
DisplacementMap = displacement
AlphaMap = alpha
SpecColorMap = specColor
SpecPowerMap = specPower

        So with that we now have a fully functioning binary file format! And more than that, with a few changes in the engine code, we can load the scene cold in less than 2 seconds! (It's almost instant if your file cache is still hot). (Pre-compiled binaries here).

Well that's it for now. As always, feel free to ask questions and comment.

Happy coding
-RichieSams