aboutsummaryrefslogtreecommitdiff
path: root/vendor/gmp-6.3.0/mpn/sparc64/README
diff options
context:
space:
mode:
authorThomas Voss <mail@thomasvoss.com> 2024-06-21 23:36:36 +0200
committerThomas Voss <mail@thomasvoss.com> 2024-06-21 23:42:26 +0200
commita89a14ef5da44684a16b204e7a70460cc8c4922a (patch)
treeb23b4c6b155977909ef508fdae2f48d33d802813 /vendor/gmp-6.3.0/mpn/sparc64/README
parent1db63fcedab0b288820d66e100b1877b1a5a8851 (diff)
Basic constant folding implementation
Diffstat (limited to 'vendor/gmp-6.3.0/mpn/sparc64/README')
-rw-r--r--vendor/gmp-6.3.0/mpn/sparc64/README125
1 files changed, 125 insertions, 0 deletions
diff --git a/vendor/gmp-6.3.0/mpn/sparc64/README b/vendor/gmp-6.3.0/mpn/sparc64/README
new file mode 100644
index 0000000..e2c051a
--- /dev/null
+++ b/vendor/gmp-6.3.0/mpn/sparc64/README
@@ -0,0 +1,125 @@
+Copyright 1997, 1999-2002 Free Software Foundation, Inc.
+
+This file is part of the GNU MP Library.
+
+The GNU MP Library is free software; you can redistribute it and/or modify
+it under the terms of either:
+
+ * the GNU Lesser General Public License as published by the Free
+ Software Foundation; either version 3 of the License, or (at your
+ option) any later version.
+
+or
+
+ * the GNU General Public License as published by the Free Software
+ Foundation; either version 2 of the License, or (at your option) any
+ later version.
+
+or both in parallel, as here.
+
+The GNU MP Library is distributed in the hope that it will be useful, but
+WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
+for more details.
+
+You should have received copies of the GNU General Public License and the
+GNU Lesser General Public License along with the GNU MP Library. If not,
+see https://www.gnu.org/licenses/.
+
+
+
+
+
+This directory contains mpn functions for 64-bit V9 SPARC
+
+RELEVANT OPTIMIZATION ISSUES
+
+Notation:
+ IANY = shift/add/sub/logical/sethi
+ IADDLOG = add/sub/logical/sethi
+ MEM = ld*/st*
+ FA = fadd*/fsub*/f*to*/fmov*
+ FM = fmul*
+
+UltraSPARC can issue four instructions per cycle, with these restrictions:
+* Two IANY instructions, but only one of these may be a shift. If there is a
+ shift and an IANY instruction, the shift must precede the IANY instruction.
+* One FA.
+* One FM.
+* One branch.
+* One MEM.
+* IANY/IADDLOG/MEM must be insn 1, 2, or 3 in an issue bundle. Taken branches
+ should not be in slot 4, since that makes the delay insn come from separate
+ bundle.
+* If two IANY/IADDLOG instructions are to be executed in the same cycle and one
+ of these is setting the condition codes, that instruction must be the second
+ one.
+
+To summarize, ignoring branches, these are the bundles that can reach the peak
+execution speed:
+
+insn1 iany iany mem iany iany mem iany iany mem
+insn2 iaddlog mem iany mem iaddlog iany mem iaddlog iany
+insn3 mem iaddlog iaddlog fa fa fa fm fm fm
+insn4 fa/fm fa/fm fa/fm fm fm fm fa fa fa
+
+The 64-bit integer multiply instruction mulx takes from 5 cycles to 35 cycles,
+depending on the position of the most significant bit of the first source
+operand. When used for 32x32->64 multiplication, it needs 20 cycles.
+Furthermore, it stalls the processor while executing. We stay away from that
+instruction, and instead use floating-point operations.
+
+Floating-point add and multiply units are fully pipelined. The latency for
+UltraSPARC-1/2 is 3 cycles and for UltraSPARC-3 it is 4 cycles.
+
+Integer conditional move instructions cannot dual-issue with other integer
+instructions. No conditional move can issue 1-5 cycles after a load. (This
+might have been fixed for UltraSPARC-3.)
+
+The UltraSPARC-3 pipeline is very simular to the one of UltraSPARC-1/2 , but is
+somewhat slower. Branches execute slower, and there may be other new stalls.
+But integer multiply doesn't stall the entire CPU and also has a much lower
+latency. But it's still not pipelined, and thus useless for our needs.
+
+STATUS
+
+* mpn_lshift, mpn_rshift: The current code runs at 2.0 cycles/limb on
+ UltraSPARC-1/2 and 2.65 on UltraSPARC-3. For UltraSPARC-1/2, the IEU0
+ functional unit is saturated with shifts.
+
+* mpn_add_n, mpn_sub_n: The current code runs at 4 cycles/limb on
+ UltraSPARC-1/2 and 4.5 cycles/limb on UltraSPARC-3. The 4 instruction
+ recurrency is the speed limiter.
+
+* mpn_addmul_1: The current code runs at 14 cycles/limb asymptotically on
+ UltraSPARC-1/2 and 17.5 cycles/limb on UltraSPARC-3. On UltraSPARC-1/2, the
+ code sustains 4 instructions/cycle. It might be possible to invent a better
+ way of summing the intermediate 49-bit operands, but it is unlikely that it
+ will save enough instructions to save an entire cycle.
+
+ The load-use of the u operand is not enough scheduled for good L2 cache
+ performance. The UltraSPARC-1/2 L1 cache is direct mapped, and since we use
+ temporary stack slots that will conflict with the u and r operands, we miss
+ to L2 very often. The load-use of the std/ldx pairs via the stack are
+ perhaps over-scheduled.
+
+ It would be possible to save two instructions: (1) The mov could be avoided
+ if the std/ldx were less scheduled. (2) The ldx of the r operand could be
+ split into two ld instructions, saving the shifts/masks.
+
+ It should be possible to reach 14 cycles/limb for UltraSPARC-3 if the fp
+ operations where rescheduled for this processor's 4-cycle latency.
+
+* mpn_mul_1: The current code is a straightforward edit of the mpn_addmul_1
+ code. It would be possible to shave one or two cycles from it, with some
+ labour.
+
+* mpn_submul_1: Simpleminded code just calling mpn_mul_1 + mpn_sub_n. This
+ means that it runs at 18 cycles/limb on UltraSPARC-1/2 and 23 cycles/limb on
+ UltraSPARC-3. It would be possible to either match the mpn_addmul_1
+ performance, or in the worst case use one more instruction group.
+
+* US1/US2 cache conflict resolving. The direct mapped L1 date cache of US1/US2
+ is a problem for mul_1, addmul_1 (and a prospective submul_1). We should
+ allocate a larger cache area, and put the stack temp area in a place that
+ doesn't cause cache conflicts.