Post by emulog on Jan 12, 2023 7:24:18 GMT
Test case is thus :
_newimage of 256*192*32 need to be scaled to 1024*768*32
To avoid slow pset's i use an array and do direct array addressing, after all operations to image, i put this array to _newimage, what is quite fast
Then i do _putimage with scaling without _smooth to main screen and measure speed in a cycle of say 1000 times to
Test show that these two operations of put and putimage do 412 times a sec, i.e. FPS
Okay, not slow but is not fast, let's try scale by my code to bigger array and then put it to main screen. Optimized cycle using uint64 get just some 80 FPS.
So what is in the assembly then ? Going $Cheking:off, looking into the EXE i see lotta of variable load and saves, almost none of register usage to avoid indirect loading, and even _shr and _shl operations are done by using CALL. If i write say *256 it does issue shift, if i write _shr/l it make call. Oh my.
Also for select/case none of jumptable generated at all, a raw of jne. Compiler options are set to -O3 -arch native.
Expectation ruined, so i make a try something else. Got FBC. Start anew, make some tests. Okay, lets write scaler there. Make two DIMs, use pointers, do PUT scaled result to main screen.
Test show 10354 FPS for a just this scale op. Okay , look into the asm. Sequential writes to array elements go in a row of simple indirect writing by eax register. 16 writes of same value, making 256>1024.
10354 vs 412 is a 25 times difference, so _putimage in QB64 is 25x slower than a hand made FOR..NEXT in FB . Both BASIC's use same GCC with -O3 -arch native on the AMD 5700G CPU.
QB64 is a very friendly IDE, alas not fast.
PS...
Ported all written code of Z80 emulator, not much of, though, to FB.
Most demanding is a digital filter of blurring image by surrounding 8 pixels, source image is 16 colour palette 256*192,
First done conversion of 16bit pairs (pixel+attr) to uint64 pack 8*4bpp, then via LUT, get three RGB colours for 4+4+1 pixel packs, sum it, then scale 4x to 1024*768
Got 261 FPS on QB, and same code, ported to FB, with new scaler instead of PUTIMAGE, give out 4277 FPS. This time 16x speed gain.
QB scale out sample :
Sub SCRSHOW (RR As Integer, IMG As Integer) Static
_Dest RR: Put (0, 0), SS(), PSet: _PutImage , RR, IMG
End Sub
FB scale out sample:
q3=1024-3 : q4=3072-1
p1=@zx(5)
For y =0 To 191
p2=@sca(5)+y*4096+3
For x=0 To 255
q1=*p1 ' get SRC pix
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2-=q4 ' write DEST pix
p1+=1
Next
Next
Put img,(0,0),sca,PSet
_newimage of 256*192*32 need to be scaled to 1024*768*32
To avoid slow pset's i use an array and do direct array addressing, after all operations to image, i put this array to _newimage, what is quite fast
Then i do _putimage with scaling without _smooth to main screen and measure speed in a cycle of say 1000 times to
Test show that these two operations of put and putimage do 412 times a sec, i.e. FPS
Okay, not slow but is not fast, let's try scale by my code to bigger array and then put it to main screen. Optimized cycle using uint64 get just some 80 FPS.
So what is in the assembly then ? Going $Cheking:off, looking into the EXE i see lotta of variable load and saves, almost none of register usage to avoid indirect loading, and even _shr and _shl operations are done by using CALL. If i write say *256 it does issue shift, if i write _shr/l it make call. Oh my.
Also for select/case none of jumptable generated at all, a raw of jne. Compiler options are set to -O3 -arch native.
Expectation ruined, so i make a try something else. Got FBC. Start anew, make some tests. Okay, lets write scaler there. Make two DIMs, use pointers, do PUT scaled result to main screen.
Test show 10354 FPS for a just this scale op. Okay , look into the asm. Sequential writes to array elements go in a row of simple indirect writing by eax register. 16 writes of same value, making 256>1024.
10354 vs 412 is a 25 times difference, so _putimage in QB64 is 25x slower than a hand made FOR..NEXT in FB . Both BASIC's use same GCC with -O3 -arch native on the AMD 5700G CPU.
QB64 is a very friendly IDE, alas not fast.
PS...
Ported all written code of Z80 emulator, not much of, though, to FB.
Most demanding is a digital filter of blurring image by surrounding 8 pixels, source image is 16 colour palette 256*192,
First done conversion of 16bit pairs (pixel+attr) to uint64 pack 8*4bpp, then via LUT, get three RGB colours for 4+4+1 pixel packs, sum it, then scale 4x to 1024*768
Got 261 FPS on QB, and same code, ported to FB, with new scaler instead of PUTIMAGE, give out 4277 FPS. This time 16x speed gain.
QB scale out sample :
Sub SCRSHOW (RR As Integer, IMG As Integer) Static
_Dest RR: Put (0, 0), SS(), PSet: _PutImage , RR, IMG
End Sub
FB scale out sample:
q3=1024-3 : q4=3072-1
p1=@zx(5)
For y =0 To 191
p2=@sca(5)+y*4096+3
For x=0 To 255
q1=*p1 ' get SRC pix
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2-=q4 ' write DEST pix
p1+=1
Next
Next
Put img,(0,0),sca,PSet