QB64 unable to generate proper fast EXE

emulog
New Member

Posts: 10

QB64 unable to generate proper fast EXE Jan 12, 2023 7:24:18 GMT

Quote

Post by emulog on Jan 12, 2023 7:24:18 GMT

Test case is thus :

_newimage of 256*192*32 need to be scaled to 1024*768*32
To avoid slow pset's i use an array and do direct array addressing, after all operations to image, i put this array to _newimage, what is quite fast
Then i do _putimage with scaling without _smooth to main screen and measure speed in a cycle of say 1000 times to
Test show that these two operations of put and putimage do 412 times a sec, i.e. FPS

Okay, not slow but is not fast, let's try scale by my code to bigger array and then put it to main screen. Optimized cycle using uint64 get just some 80 FPS.
So what is in the assembly then ? Going $Cheking:off, looking into the EXE i see lotta of variable load and saves, almost none of register usage to avoid indirect loading, and even _shr and _shl operations are done by using CALL. If i write say *256 it does issue shift, if i write _shr/l it make call. Oh my.

Also for select/case none of jumptable generated at all, a raw of jne. Compiler options are set to -O3 -arch native.

Expectation ruined, so i make a try something else. Got FBC. Start anew, make some tests. Okay, lets write scaler there. Make two DIMs, use pointers, do PUT scaled result to main screen.
Test show 10354 FPS for a just this scale op. Okay , look into the asm. Sequential writes to array elements go in a row of simple indirect writing by eax register. 16 writes of same value, making 256>1024.

10354 vs 412 is a 25 times difference, so _putimage in QB64 is 25x slower than a hand made FOR..NEXT in FB . Both BASIC's use same GCC with -O3 -arch native on the AMD 5700G CPU.

QB64 is a very friendly IDE, alas not fast.

PS...
Ported all written code of Z80 emulator, not much of, though, to FB.
Most demanding is a digital filter of blurring image by surrounding 8 pixels, source image is 16 colour palette 256*192,
First done conversion of 16bit pairs (pixel+attr) to uint64 pack 8*4bpp, then via LUT, get three RGB colours for 4+4+1 pixel packs, sum it, then scale 4x to 1024*768
Got 261 FPS on QB, and same code, ported to FB, with new scaler instead of PUTIMAGE, give out 4277 FPS. This time 16x speed gain.

QB scale out sample :

Sub SCRSHOW (RR As Integer, IMG As Integer) Static

_Dest RR: Put (0, 0), SS(), PSet: _PutImage , RR, IMG

End Sub

FB scale out sample:

q3=1024-3 : q4=3072-1
p1=@zx(5)

For y =0 To 191
p2=@sca(5)+y*4096+3

For x=0 To 255
q1=*p1 ' get SRC pix

*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=q3
*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2+=1:*p2=q1:p2-=q4 ' write DEST pix

p1+=1
Next
Next
Put img,(0,0),sca,PSet

Last Edit: Jan 12, 2023 11:13:35 GMT by emulog

tonylazuto New Member Tony Lazuto says hello Posts: 25	QB64 unable to generate proper fast EXE Jan 12, 2023 16:40:25 GMT Quote Select Post Deselect Post Link to Post Member Give Gift Back to Top Post by tonylazuto on Jan 12, 2023 16:40:25 GMT If you know the ASM you need, as well as the bytecode interpretation of the ASM, I can help you just run those bytecodes directly in QB64. Very simple.
	Last Edit: Jan 12, 2023 16:40:35 GMT by tonylazuto

emulog
New Member

Posts: 10

QB64 unable to generate proper fast EXE Jan 13, 2023 0:01:55 GMT

Quote

Post by emulog on Jan 13, 2023 0:01:55 GMT

ASM is neither easy thing, a know Z80 asm at homebrew level, like create a tape loader, make screen sprites, do some beep-beep sounds etc.
X86 asm is just the same but LD mnemonic became MOV, BC register became AH, and so on. I can read and write, bo no much of real programming ever done.
To write some code i need an online manual and lotta test runs.

basicme
New Member

Posts: 2

QB64 unable to generate proper fast EXE Jan 14, 2023 13:25:23 GMT

Quote

Post by basicme on Jan 14, 2023 13:25:23 GMT

If you go with hardware images, QB64 runs a few million frames per second on my machine.

DIM count AS _INTEGER64
displayScreen = _NEWIMAGE(1024, 768, 32)
drawScreen = _NEWIMAGE(256, 192, 32)

SCREEN displayScreen
_DEST drawScreen
FOR i = 1 TO 100 'draws something on the drawscreen
    LINE (RND * 256, RND * 192)-(RND * 256, RND * 192), _RGB32(RND * 256, RND * 256, RND * 256), BF
NEXT

hardwareScreen = _COPYIMAGE(drawScreen, 33)


time# = TIMER + 1
_DISPLAYORDER _HARDWARE
DO
    count = count + 1
    IF TIMER > time# THEN
        _TITLE "FPS:" + STR$(count)
        count = 0
        time# = TIMER + 1
    END IF
    _PUTIMAGE , hardwareScreen
    _DISPLAY
LOOP UNTIL _KEYHIT

emulog
New Member

Posts: 10

QB64 unable to generate proper fast EXE Jan 17, 2023 9:10:36 GMT pauloastro likes this

Quote

Post by emulog on Jan 17, 2023 9:10:36 GMT

Indeed.
I moved draw and _copyimage cycle to inside putimage cycle and got some 9000 fps.
But lines/BF is a dummy cpu load, just to see how is it going.
I use a code of blurring the source "Spectrum 48K" image by 3*3 sum calculation, runs here on QB with some 260-270 fps, and 722 fps without that slow _putimage i used before.
Now i got a total rewrite of it on FB doing 5*5 sum calculation, with some luma tweaks, and get some 2788 fps, all data is 256 bits aligned.
Porting that back to do test speed is really unresolvable, all done via *ptr, and all asm generated by, as i have checked, is heavy on register usage, of course indirect loading still present, but much less that in QB compile.

Here is a sample that QB64 blur code and the FB version, just to see whatever is written bad... Speed difference is around 21x times

' - - - - BLUR IMAGE 261fps, VIA DIVD/S FROM QPX() VIA PQM(),PTOC() TO SS() VIA CLIM(), LUT PER PIXEL 3.5, OPT
Sub SCRBLUR () Static
$Checking:Off
Dim As _Unsigned Long TI, RY, VLX, C2, HOR ' some vars
SCRPACK ' a sub that do conversion of bitmap to 4bpp bitmap
TI = 0: For RY = 0 To 191: ' cycle screen lines downwards
V1 = QPX(TI): V4 = QPX(TI + 1): V2 = QPX(TI + 18): V5 = QPX(TI + 19): V3 = QPX(TI + 36): V6 = QPX(TI + 37) ' uint64 vars get packs of pixels by 16 pixels in, total 6 reads

VLX = 16: C2 = 2: For HOR = 1 To 256 ' cycle pixel left to right

' performing cutout by 4 pixels from 3*3 matrix and get its RGB presummed colour via LUT, so LUT+LUT+direct RGB, 3 additions, rgb is 48 bit version
V7 = PQM(_SHR(V1, 48) And 65520 Or _SHR(V2, 60)) + PQM(_SHR(V3, 48) And 65520 Or (_SHR(V2, 52) And 15)) + PTOC(_SHR(V2, 56) And 15)

SS(HOR + RY * 256) = _SHL(CLIM(_SHR(V7, 32) And 65535), 16) Or _SHL(CLIM(_SHR(V7, 16) And 65535), 8) Or _SHL(CLIM(_SHR(V7, 0) And 65535), 0)' limit rgb lumas to 0-255 and write to virtual screen SS()

V1 = _SHL(V1, 4) Or _SHR(V4, 60): V2 = _SHL(V2, 4) Or _SHR(V5, 60): V3 = _SHL(V3, 4) Or _SHR(V6, 60) ' shift source unit64 pixels one pixel left, so a window of 3*3 situate at same bits

VLX = VLX - 1: If VLX = 0 Then V4 = QPX(TI + C2): V5 = QPX(TI + 18 + C2): V6 = QPX(TI + 36 + C2): C2 = C2 + 1: VLX = 16 ' have a counter of pixels done, so here do upload of next uint64 packs , total 3 reads
Else
V4 = _SHL(V4, 4): V5 = _SHL(V5, 4): V6 = _SHL(V6, 4) ' here do shift by one pixel left of 'upload uint 64 buffer '
End If
Next: TI = TI + 18: Next:
End Sub
' - - - - - - - - - -

And this is the FB sample of same code, do two pixel calculations instead of one as in QB64 version

' - - - - BLUR IMAGE, GOOD OPTIMIZED 5580 FPS, VIA SCRPACK/PQM/CLIM/PTOC

Sub SCRBLUR () Static

Dim As Unsigned Long TI, RY, VLX, C2, HOR,XHOR
Dim As Unsigned Long Ptr Q1
Dim As ULongInt Ptr Q2,Q3
Dim As ULongInt V1,V2,V3,V4,V5,V6,V7

SCRPACK:

Q2=PQPX:For RY = 0 To 191: ' cycle screen lines downwards
V1 = *Q2: V4 = *(Q2+1): V2 = *(Q2+18): V5 = *(Q2+19): V3 = *(Q2+36): V6 = *(Q2+37): ' 6 reads of uint64

VLX = 16:Q1=PSS+(7+RY * 256):Q3=Q2+2:For HOR = 1 To 256 Step 2 ' go left to right

V7 = *(PPQM+((V1 Shr 48) And 65520 Or (V2 Shr 60)))+ *(PPQM+((V3 Shr 48) And 65520 Or ((V2 Shr 52) And 15)))+ *(PPTOC+((V2 Shr 56) And 15)) ' combine 4pix+4pix+direct colour, all via lut get rgb 48bit

Q1+=1:*Q1 = (*(PCLIM+((V7 Shr 32) And 65535))Shl 16) + (*(PCLIM+((V7 Shr 16) And 65535))Shl 8) + *(PCLIM+(V7 And 65535)) ' write pixel to virtual screen, with luma limited to 0-255

V1 = (V1 Shl 4) Or (V4 Shr 60):V4 = (V4 Shl 4):V2 = (V2 Shl 4) + (V5 Shr 60):V5 = (V5 Shl 4):V3 = (V3 Shl 4) + (V6 Shr 60):V6 = (V6 Shl 4) ' shift source uint64 pixels

V7 = *(PPQM+((V1 Shr 48) And 65520 Or (V2 Shr 60)))+ *(PPQM+((V3 Shr 48) And 65520 Or ((V2 Shr 52) And 15)))+ *(PPTOC+((V2 Shr 56) And 15)) ' another combine

Q1+=1:*Q1 = (*(PCLIM+((V7 Shr 32) And 65535))Shl 16) + (*(PCLIM+((V7 Shr 16) And 65535))Shl 8) + *(PCLIM+(V7 And 65535)) ' write pixel

V1 = (V1 Shl 4) Or (V4 Shr 60):V4 = (V4 Shl 4):V2 = (V2 Shl 4) + (V5 Shr 60):V5 = (V5 Shl 4):V3 = (V3 Shl 4) + (V6 Shr 60):V6 = (V6 Shl 4)

VLX = VLX - 2:If VLX = 0 Then V4 = *(Q3): V5 = *(Q3+18): V6 = *(Q3+36):Q3+=1:VLX = 16 ' upload uint64 buffer with 3 reads if 16 pixel-ops done
Next:Q2+=18:Next:End Sub
' - - - - - - - - - -
Here is a picture what it is all about, first is source, next is done with blur 5*5/pattern detection

Attachments:

Last Edit: Jan 17, 2023 9:54:46 GMT by emulog

mohai
New Member

Posts: 7

QB64 unable to generate proper fast EXE Jan 22, 2023 11:50:50 GMT

Quote

Post by mohai on Jan 22, 2023 11:50:50 GMT

Hello emulog.

So, you are doing a ZX-Spectrum emulator with QB64. That sounds interesting...
I am willing to see it working.
I have the idea to make something similar, but with different purpose.
I know the problem in QB64 with not-so-efficient exe code. I think QB64 does not do real machine-code execution, but a mix of basic-interpreting and real-time execution but, for mayor applications, it is OK and Inform can create nice GUIs.

FB means FreeBasic?. I do not know it. I will take a look a at it.

pauloastro
New Member

Posts: 9

QB64 unable to generate proper fast EXE Jan 25, 2023 9:52:48 GMT

Quote

Post by pauloastro on Jan 25, 2023 9:52:48 GMT

Very nice speed tests emulog.

I am just curious about the Why start an emulator using qb64 in the firsts place. ;-)

Just because qb64 is a C/C++ transpiler?

Just some thoughts: it would be very interesting to see one more emulator in qb64;

Don't know what are the minimum hardware requirements that you imposed on your emulator project, but Why give up now on using qb64 for that emulator? A more simple algorithm could do the trick!?

Paulo.

QB64 unable to generate proper fast EXE

Post by emulog on Jan 12, 2023 7:24:18 GMT

Post by tonylazuto on Jan 12, 2023 16:40:25 GMT

Post by emulog on Jan 13, 2023 0:01:55 GMT

Post by basicme on Jan 14, 2023 13:25:23 GMT

Post by emulog on Jan 17, 2023 9:10:36 GMT

Post by mohai on Jan 22, 2023 11:50:50 GMT

Post by pauloastro on Jan 25, 2023 9:52:48 GMT