I am trying to win a bet with a friend of mine on whos "echo" recreation is more efficient. Executable size also matters but pure raw speed is what is important.

This really isn't a "help me" kind of question but more of a "any ideas for improvement" kind of question.

He is making his echo recreation in C and I am making mine in 64-bit assembly using the nasm assembler. We both wanted to know if the GCC compiler makes programming in assembly pointless.

The code compiles into a "1.9 KB" executable and uses a total of roughly "2 KB" of RAM. That's pretty darn small to me. I did reference some C libraries however I made sure with him that doing so would be acceptable.

Code:
extern strcat
extern puts

segment .text
	global main

main:
	;Set up stack
	push r12
	push rbp
	mov	rbp, rsi
	push rbx
	mov	ebx, edi
	sub	rsp, 48
	
	;If argc == 1, no arguments
	cmp	edi, 1
	je .done

	;Else continue
.start:
	lea	rdi, [rsp+16]
	mov	ecx, 8
	mov esi, 0
	mov	[rsp+8], esi
	mov r12d, 0
	
	jmp	.print
	
.loop:
	mov	rsi, [rbp+0+r12*8]
	lea	rdi, [rsp+8]
	call strcat
	
	lea	rdi, [rsp+8]
	mov	esi, space
	call strcat

.print:
	inc	r12
	cmp	ebx, r12d
	jg	.loop
	
	;Print out result
	lea	rdi, [rsp+8]
	call puts
	
.done:
	;End program
	mov	eax,1		
	mov	ebx,0		
	int	80h	
	ret
	
section .data
	space db " ", 0