Jump to content
  • entries
    943
  • comments
    5,899
  • views
    924,383

Multithreaded Rendering


Josh

10,366 views

 Share

After working out a thread manager class that stores a stack of C++ command buffers, I've got a pretty nice proof of concept working. I can call functions in the game thread and the appropriate actions are pushed onto a command buffer that is then passed to the rendering thread when World::Render is called. The rendering thread is where all the (currently) OpenGL code is executed. When you create a context or load a shader, all it does is create the appropriate structure and send a request over to the rendering thread to finish the job:

Image2.jpg.708878cad36d0cb33fb2c014a380a644.jpg

Consequently, there is currently no way of detecting if OpenGL initialization fails(!) and in fact the game will still run along just fine without any graphics rendering! We obviously need a mechanism to detect this, but it is interesting that you can now load a map and run your game without ever creating a window or graphics context. The following code is perfectly legitimate in Leawerks 5:

#include "Leadwerks.h"

using namespace Leadwerks

int main(int argc, const char *argv[])
{
	auto world = CreateWorld()
	auto map = LoadMap(world,"Maps/start.map");

	while (true)
	{
		world->Update();
	}
	return 0;
}

The rendering thread is able to run at its own frame rate independently from the game thread and I have tested under some pretty extreme circumstances to make sure the threads never lock up. By default, I think the game loop will probably self-limit its speed to a maximum of 30 updates per second, giving you a whopping 33 milliseconds for your game logic, but this frequency can be changed to any value, or completely removed by setting it to zero (not recommended, as this can easily lock up the rendering thread with an infinite command buffer stack!). No matter the game frequency, the rendering thread runs at its own speed which is either limited by the window refresh rate, an internal clock, or it can just be let free to run as fast as possible for those triple-digit frame rates.

Shaders are now loaded from multiple files instead of being packed into a single .shader file. When you load a shader, the file extension will be stripped off (if it is present) and the engine will look for .vert, .frag, .geom, .eval, and .ctrl files for the different shader stages:

auto shader = LoadShader("Shaders/Model/diffuse");

The asynchronous shader compiling in the engine could make our shader editor a little bit more difficult to handle, except that I don't plan on making any built-in shader editor in the new editor! Instead I plan to rely on Visual Studio Code as the official IDE, and maybe add a plugin that tests to see if shaders compile and link on your current hardware. I found that a pragma statement can be used to indicate include files (not implemented yet) and it won't trigger any errors in the VSCode intellisense:

Image3.thumb.png.0ed27ec4d07882f01ace4727ca37c2a8.png

Although restructuring the engine to work in this manner is a big task, I am making good progress. Smart pointers make this system really easy to work with. When the owning object in the game thread goes out of scope, its associated rendering object is also collected...unless it is still stored in a command buffer or otherwise in use! The relationships I have worked out work perfectly and I have not run into any problems deciding what the ownership hierarchy should be. For example, a context has a shared pointer to the window it belongs to, but the window only has a weak pointer to the context. If the context handle is lost it is deleted, but if the window handle is lost the context prevents it from being deleted. The capabilities of modern C++ and modern hardware are making this development process a dream come true.

Of course with forward rendering I am getting about 2000 FPS with a blank screen and Intel graphics, but the real test will be to see what happens when we start adding lots of lights into the scene. The only reason it might be possible to write a good forward renderer now is because graphics hardware has gotten a lot more flexible. Using a variable-length for loop and using the results of a texture lookup for the coordinates of another lookup :blink: were a big no-no when we first switched to deferred rendering, but it looks like that situation has improved.

The increased restrictions on the renderer and the total separation of internal and user-exposed classes are actually making it a lot easier to write efficient code. Here is my code for the indice array buffer object that lives in the rendering thread:

#include "../../Leadwerks.h"

namespace Leadwerks
{
	OpenGLIndiceArray::OpenGLIndiceArray() :
		buffer(0),
		buffersize(0),
		lockmode(GL_STATIC_DRAW)
	{}

	OpenGLIndiceArray::~OpenGLIndiceArray() {
		if (buffer != 0) {
#ifdef DEBUG
			Assert(glIsBuffer(buffer),"Invalid indice buffer.");
#endif
			glDeleteBuffers(1, &buffer);
			buffer = 0;
		}
	}

	bool OpenGLIndiceArray::Modify(shared_ptr<Bank> data) {

		//Error checks
		if (data == nullptr) return false;
		if (data->GetSize() == 0) return false;

		//Generate buffer
		if (buffer == 0) glGenBuffers(1, &buffer);
		if (buffer == 0) return false; //shouldn't ever happen

		//Bind buffer
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, buffer);

		//Set data
		if (buffersize == data->GetSize() and lockmode == GL_DYNAMIC_DRAW) {
			glBufferSubData(GL_ELEMENT_ARRAY_BUFFER, 0, data->GetSize(), data->buf);
		}
		else {
			if (buffersize == data->GetSize()) lockmode = GL_DYNAMIC_DRAW;
			glBufferData(GL_ELEMENT_ARRAY_BUFFER, data->GetSize(), data->buf, lockmode);
		}

		buffersize = data->GetSize();
		return true;
	}
	
	bool OpenGLIndiceArray::Enable() {
		if (buffer == 0) return false;
		if (buffersize == 0) return false;
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, buffer);
		return true;
	}

	void OpenGLIndiceArray::Disable() {
		glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
	}
}

From everything I have seen, my gut feeling tells me that the new engine is going to be ridiculously fast.

If you would like to be notified when Leadwerks 5 becomes available, be sure to sign up for the mailing list here.

 Share

16 Comments


Recommended Comments

The increased isolation and simplification of the OpenGL Code also means it is now much much easier to write a custom renderer. It would be pretty simple to create an OpenGL 1 or a DirectX renderer for the engine...or Vulkan support can be added without much trouble.

Link to comment

Would it be possible to simulate the world physics without OpenGL? Would be nice for multiplayer games that need to run on a server that doesn't have a gfx card.

  • Upvote 1
Link to comment

Very cool, but this is still more rendering separately on a thread than multi-threaded rendering. No matter how you cut it in GL the heavy work can't be spread across multiple threads so your GPU is always bored waiting for the under-used CPU to send it work, although in GL this is as good as it's going to get which is good enough. Still like your MoltenVK idea the best.

Either way it is neat to be able to control the frame rate of physics and game logic separately of rendering.

Link to comment
3 hours ago, Crazycarpet said:

Very cool, but this is still more rendering separately on a thread than multi-threaded rendering. No matter how you cut it in GL the heavy work can't be spread across multiple threads so your GPU is always bored waiting for the under-used CPU to send it work, although in GL this is as good as it's going to get which is good enough. Still like your MoltenVK idea the best.

Either way it is neat to be able to control the frame rate of physics and game logic separately of rendering.

All the multithreaded graphics APIs actually just accumulate commands in a buffer and then execute them in a single thread. I plan to separate the culling and rendering threads so that there is absolutely no overhead in the rendering thread. The rendering thread may draw the same list of visible objects several times before the culling thread feeds it a new visibility list. This will involve a fair amount of latency in the system, but the VR head and control orientations will be read each frame at the very last moment possible, so the things that you would notice latency with won't have any. I think it's going to be insanely fast.

Link to comment
On 4/14/2018 at 3:24 AM, Josh said:

All the multithreaded graphics APIs actually just accumulate commands in a buffer and then execute them in a single thread

The benefit to the multi-threaded APIs is that every thread has it's own command pool, and each thread can write to a command buffer so you can use any available threads to write to the command buffers. They are in the end submitted together, yes, but getting to the point where all command buffers are good-to-go is way faster. That's why they designed them this way. In the end, less time is spent waiting for 1 CPU thread to write all the command buffers.

Nvidia has a great document about this: https://developer.nvidia.com/sites/default/files/akamai/gameworks/blog/munich/mschott_vulkan_multi_threading.pdf

Link to comment

We will see. Doom 2016 ran about the same or even slower on Vulcan in the benchmarks I saw, on Nvidia hardware.

My final render stage is just a list of visible surfaces and that part could easily be split into a bunch of different threads.

But we will see when I actually test it out.

Link to comment

Doom doesn't use a multi-threaded renderer. Of course Vulkan isn't going to magically make things faster on it's own, it gives you the ability to do it... On OpenGL you don't directly write to command buffers so you can't split the work up between threads. Vulkan in itself does not do anything multi-threading, this is something you have to implement. Vulkan just gives you the tools to design fast multi-threaded designs that were not possible prior to.

I'm not saying this is necessary, your design will be great because the game loop does not have to wait for the renderer. I'm just saying with Vulkan you could get maximum performance, you could still keep the rendering separate of the game loop too then you would end up with both faster and independent rendering.

Just spit-balling ideas because it sounds like you're trying to make LE as fast as possible, and this new API allows you to do what only DX12 could do without worrying about being locked to windows-only.

This optimization would indisputably make LE's renderer way faster, which is perfect for VR. The only question is whether or not it is necessary, is LE fast enough without it in the situations it's designed for? No sense in writing a big complex renderer if the engine is fast enough as is.

Edit:

Also keep in mind that Nvidia's OpenGL drivers are extremely fast and complex, AMDs are not. On AMD cards Vulkan does "magically" make things faster just by implementing it because their driver team went above-and-beyond on their Vulkan drivers.

Link to comment

Again, Doom doesn't do multi-threading... Why would it be faster than it's OpenGL renderer? They've had years to optimize OpenGL drivers, of course it'll be at least as fast in a single-threaded environment.

It's not magic, it's physics at that point.... Vulkan can use multiple threads to generate command buffers, more at a time; OpenGL can only do 1 at a time. It would indisputably be faster that's just the reality of it.

As time goes on and GPUs get more powerful a renderer in Vulkan that generates cmd buffers on multiple threads would be even faster because not only are you sending more work to the GPU due to the threaded command buffer generation, but the GPU would also be able to handle any work you throw at it. With high end cards today you will see big performance gains, where you wouldn't is with integrated cards... but that shouldn't be a priority.

Furthermore in Vulkan you can physically send draw calls from multiple threads and they are not send to the main thread by the driver, this is one highlight of Vulkan that only DirectX 12 has. Metal is planning this too, I have not read whether or not this is already the case in Metal, of if it's just a future plan.

  • Like 1
Link to comment

Doom Vulkan results are probably better on AMD.  It seems that nvidia hardware is better on DX11 and OpenGL, more traditional rendering methods.

You'll find AMD performing a bit better on Vulkan and DX12, in most cases though.

Link to comment

I have my doubts about OpenGL commands being a significant bottleneck. That's kind of like saying you're going to make a sports car faster by removing the floor mats. Yes, it will be a small bit lighter but I don't think you will see any difference. The number one performance bottleneck we run into is pixel fillrate.

My guess is you will see a massive performance increase with my new architecture, and then Vulkan will produce a small improvement on Nvidia cards, and perhaps a 20-30% improvement on AMD cards. But let's see what the actual numbers turn out to be.

Link to comment

The reason you'd want to multithread the command process is for situations where big, new, powerful GPUs are bored because the CPU's one thread can't send it commands fast enough to utilize it to the fullest extent. That's not a fair analogy :P so long as your GPU can handle it, why would you not want to throw more work at it? Modern GPUs (10 series, etc) can certainly handle it.

A great GPU can handle anything a single core on your CPU can throw at it with ease, so you want to throw more at it. This is the most common bottleneck in games these days with how powerful GPUs are getting. The better your GPU, the better these optimizations will help, it's more planning for the future because as time goes on you'll see more and more improvements from this type of multi-threading, that's why DX12 and Vulkan moved towards it.

Anyways like I said, it isn't usually necessary but it would be optimum, just food for thought so you consider this design if you move towards a Vulkan renderer. It'd be a shame to use Vulkan and just move all the rendering to a thread, instead of using all available threads for command buffer generation.

Link to comment
On 4/19/2018 at 9:12 PM, Crazycarpet said:

The reason you'd want to multithread the command process is for situations where big, new, powerful GPUs are bored because the CPU's one thread can't send it commands fast enough to utilize it to the fullest extent. That's not a fair analogy :P so long as your GPU can handle it, why would you not want to throw more work at it? Modern GPUs (10 series, etc) can certainly handle it.

A great GPU can handle anything a single core on your CPU can throw at it with ease, so you want to throw more at it. This is the most common bottleneck in games these days with how powerful GPUs are getting. The better your GPU, the better these optimizations will help, it's more planning for the future because as time goes on you'll see more and more improvements from this type of multi-threading, that's why DX12 and Vulkan moved towards it.

Anyways like I said, it isn't usually necessary but it would be optimum, just food for thought so you consider this design if you move towards a Vulkan renderer. It'd be a shame to use Vulkan and just move all the rendering to a thread, instead of using all available threads for command buffer generation.

Your general sentiment is correct, but the fact is, the idea of "commands" is kind of antiquated. It's more like "hi, I am the rendering thread, here is a block of bytes you will interpret a few times before the next block of bytes arrives, k thx bye".

Link to comment
Guest
Add a comment...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...