[Arm-dev] Fwd: Golang Performance

Wed Jan 4 10:03:00 UTC 2017
Gordan Bobic <gordan at redsleeve.org>

The --cpuprofile option doesn't seem to generate a human-readable file and
my go-fu is weak, but I've attached it anyway.



On Tue, Jan 3, 2017 at 9:18 PM, Jeremy Linton <jlinton at redhat.com> wrote:

> Hi,
>
> On 01/01/2017 11:57 AM, Gordan Bobic wrote:
>
>> I'm not much of a Go user at the best of times, but I am noticing that
>> there seems to be a huge ( > 10x clock-for-clock) performance
>> discrepancy between x86-64 and aarch64 binaries.
>>
>> Specific example I am looking at is rclone, uploading encryoted backups
>> to Amazon Cloud Drive.
>>
>> When I run it on a Westmere class Xeon (3.6GHz), it is comsuming about
>> 2% CPU to saturate a 20Mbit uplink. When I run it on an X-Gene (2.0GHz),
>> it is consuming about 50% CPU. Even adjusting for differences in clock
>> speeds, this seems to be a huge difference.
>>
>> Is the Go complier known to produce very poor results on aarch64, or is
>> something else in play? I know that x86-64 has a much more powerful SIMD
>> unit, but I am not convinced that this is the explanation, and rclone
>> doesn't use AES AFAIK, so hardware implementation of that doesn't seem
>> to explain it either.
>>
>
> AFAIK, amazon drive uses SSL/TLS so its likely you are using AES. Further,
> go implements their own TLS/etc libraries rather than depending on
> openSSL/gnuTLS. So, while ARMv8 cores have hardware AES instructions, go's
> AES implementation is currently only hardware accelerated on x86 and s390.
> MD5 (also used AFAIK) is also missing an ARM64 native implementation even
> though there is an ARM one. So, just those two issues could cause a large
> performance delta, but at those data rates it seems possible there is
> something else going on.
>
> So, a couple questions, did you build rclone yourself or use one of the
> binaries from rclone.org?
>
> There is a --cpuprofile option to rclone. It might be helpful if you can
> post the output from that.
>
> Thanks,
>
>
>
>> At general purpose pointer chasing such as compiling, the X-Gene seems
>> to produce similar performance clock-for-clock as the Westmere Xeon, so
>> the large discrepancy with rclone seems odd.
>>
>> Has anyone got any insights? Both machines are running CentOS 7.
>>
>> Gordan
>>
>>
>> _______________________________________________
>> Arm-dev mailing list
>> Arm-dev at centos.org
>> https://lists.centos.org/mailman/listinfo/arm-dev
>>
>>
> _______________________________________________
> Arm-dev mailing list
> Arm-dev at centos.org
> https://lists.centos.org/mailman/listinfo/arm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/arm-dev/attachments/20170104/a467a9d0/attachment-0006.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: profile.aarch64
Type: application/octet-stream
Size: 64 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/arm-dev/attachments/20170104/a467a9d0/attachment-0006.obj>