Posting here as I’ve been unsuccessfully trying to optimize my NN inference runtime in several ways and I still get horrible performance.
I am running inference on some data through a first function (Run) and collecting it through a second one (GetOutput) to allow the operation to run in a thread during the execution of my remaining Update loops.
I think this implementation should be rather equivalent to the one in the Read output asynchronously example.
This is a snippet of my code:
private void Run()
{
FloatsFromVec3();
DisposeTensors(_inputTensor, _outputTensor);
_inputTensor = new TensorFloat(new TensorShape(1, 63), _inputs);
_engine.Execute(_inputTensor);
// Peek the value from Sentis, without taking ownership of the Tensor (see PeekOutput docs for details).
_outputTensor = _engine.PeekOutput() as TensorFloat;
_outputTensor.AsyncReadbackRequest(ReadbackCallback);
}
void ReadbackCallback(bool completed)
{
if (!completed)
{
DebugUtils.LogAvatarInput("ReadbackCallback failed: not completed", DebugLevelEnum.Debug);
return;
}
// Put the downloaded tensor data into a readable tensor before indexing.
_outputTensor.MakeReadable();
DebugUtils.LogAvatarInput($"Output tensor processed", DebugLevelEnum.Debug);
}
public int GetOutput()
{
if (_skip)
return _outPose;
while (!_outputTensor.IsAsyncReadbackRequestDone())
{
DebugUtils.LogAvatarInput("Waiting for async readback to complete ....", DebugLevelEnum.Debug);
}
float[] tensorVals = new float[N];
float outVal;
if (_outputTensor != null)
{
tensorVals = _outputTensor.ToReadOnlyArray();
// get argmax
outVal = tensorVals.ToList().IndexOf(tensorVals.Max());
}
...
}
now there are some things that don’t make sense to me.
- I am importing here an extremely simple NN (2 layers x 16 neurons, with 64 inputs, for a total of ~1000 flops). Processing it takes ~0.5ms to 1ms, compared to ~3us running the same model on tflite.
- I am trying to simulate the runtime cost of my main app using a single delayer class (DataDrivenDelayer.cs) that just counts to N. My expectation would be that as I wait for a longer time my NN should be run in a thread and the total cost of the class running inference on Sentis (DataDriven.cs) should not contribute to PlayerLoop. Apparently that’s not the case.
- If I try to run a set of models in parallel their runtimes don’t some up linearly but it seems like running 4-5 simple models is much less expensive than 5 times the cost of a single one.
9 posts - 3 participants