for a metric like flops, nvprof will display minimum, maximum and average numbers across n runs, for each kernel.

I would take the average number for a given kernel, and multiply it by the number of times that kernel is run. I would add these products for all the kernels in question, then divide that total by the total duration of all the kernels. All of this data is available from nvprof. You would have to combine the results of each separate kernel together.

That should give you a fairly defensible number that you can call the average flops per second for your device code (or for those kernels in your device code).