Quantcast
Channel: AI Beta - Unity Discussions
Viewing all articles
Browse latest Browse all 109

Sentis inference is extremely slow

$
0
0

Posting here as I’ve been unsuccessfully trying to optimize my NN inference runtime in several ways and I still get horrible performance.

I am running inference on some data through a first function (Run) and collecting it through a second one (GetOutput) to allow the operation to run in a thread during the execution of my remaining Update loops.

I think this implementation should be rather equivalent to the one in the Read output asynchronously example.

This is a snippet of my code:

private void Run()
        {
            FloatsFromVec3();

            DisposeTensors(_inputTensor, _outputTensor);

            _inputTensor = new TensorFloat(new TensorShape(1, 63), _inputs);
            _engine.Execute(_inputTensor);

            // Peek the value from Sentis, without taking ownership of the Tensor (see PeekOutput docs for details).
            _outputTensor = _engine.PeekOutput() as TensorFloat;
            _outputTensor.AsyncReadbackRequest(ReadbackCallback);
        }
        
void ReadbackCallback(bool completed)
        {
            if (!completed)
            {
                DebugUtils.LogAvatarInput("ReadbackCallback failed: not completed", DebugLevelEnum.Debug);
                return;
            }
            
            // Put the downloaded tensor data into a readable tensor before indexing.
            _outputTensor.MakeReadable();
            
            DebugUtils.LogAvatarInput($"Output tensor processed", DebugLevelEnum.Debug);
        }

public int GetOutput()
        {
            if (_skip)
                return _outPose;
            
            while (!_outputTensor.IsAsyncReadbackRequestDone())
            {
                DebugUtils.LogAvatarInput("Waiting for async readback to complete  ....", DebugLevelEnum.Debug);
            }

            float[] tensorVals = new float[N];
            float outVal;
            if (_outputTensor != null)
            {
                tensorVals = _outputTensor.ToReadOnlyArray();
                
                // get argmax
                outVal = tensorVals.ToList().IndexOf(tensorVals.Max());
            }
...            
            }

now there are some things that don’t make sense to me.

  1. I am importing here an extremely simple NN (2 layers x 16 neurons, with 64 inputs, for a total of ~1000 flops). Processing it takes ~0.5ms to 1ms, compared to ~3us running the same model on tflite.
  2. I am trying to simulate the runtime cost of my main app using a single delayer class (DataDrivenDelayer.cs) that just counts to N. My expectation would be that as I wait for a longer time my NN should be run in a thread and the total cost of the class running inference on Sentis (DataDriven.cs) should not contribute to PlayerLoop. Apparently that’s not the case.
  3. If I try to run a set of models in parallel their runtimes don’t some up linearly but it seems like running 4-5 simple models is much less expensive than 5 times the cost of a single one.

9 posts - 3 participants

Read full topic


Viewing all articles
Browse latest Browse all 109

Trending Articles