Home Products Documents links News Events Tech Info

Origins of the SWP instruction

From: john@acorn.co.uk (John Bowler)
Subject: Re: Multiprocessing Archimedes??
Date: 16 Aug 91 11:10:50 GMT

torq@GNU.AI.MIT.EDU (Andrew Mell) writes:
>I notice that the Arm3 has a new instruction over the Arm2 which is
>SWP. It swaps a byte or a word between register and external memory.
>(uninterruptible between the read and write)
  ^^^^^^^^^^^^^^^

Indeed, but not necessarily not interleavable with other memory operations
(sorry about the double negative :-).  In particular, to fully support the
SWP on a system with multiple memory bus masters the memory control logic
which decides which bus master has access to the memory next would have to
force an interlock between the memory read and memory write of the SWP
instruction.  Now, the ARM3 has a LOCK pin for this, but to support
multi-processors you need to connect it to something :-).

>All very interesting you might say, but it intrigues me as this sort
>of instruction is usually only used in multiprocessor systems as a 
>software semaphore.
>
>Why did Acorn add this instruction to the Arm3?

Because a long time ago, when we were very young (;-) we tried to write a
multi-threaded OS (ARX) and we ``found'' (sic, thought)  that it was
spending a lot of time going into supervisor mode and disabling interrupts
so that it could implement mutexes (for user mode code - including the OS,
which ran in user mode too).  In theory SWP allows user code to implement
mutexes efficiently.

As far as I am concerned the MP aspects of SWP are bonuses (clearly these
were considered at the same time - or the LOCK pin wouldn't be there).
Notice that SWP always bypasses the cache; again this is MP support, however
there is an ommission here in that it is impossible to do a (reliable) read
from external memory (you might get the cache contents instead!)

John Bowler (jbowler@acorn.co.uk)


From: john@acorn.co.uk (John Bowler)
Subject: Re: Multiprocessing Archimedes??
Date: 19 Aug 91 16:25:33 GMT

julian@bridge.welly.gen.nz writes:
>john@acorn.co.uk (John Bowler) writes:
>
>> Notice that SWP always bypasses the cache; again this is MP support, however
>> there is an ommission here in that it is impossible to do a (reliable) read
>> from external memory (you might get the cache contents instead!)
>
>If you're using it to implement semaphores, this is not a problem, as you'd
>never need to access the semaphore with any instruction other than SWP.

Yes; there is no problem with the semaphore, but the semaphore must be
protecting some state which is shared.  When a processor has claimed that
semaphor it probably needs to read the state and to obtain consistent
results when it reads it.  If the data is in cacheable memory the only way
it can do that is to use sequences of the form:-

             SWP rx, rx, [raddr]         ; read a value out
             STR rx, [raddr]             ; and put it back... :-(

The alternative is to allocate shared data in uncacheable memory.  This
requires some OS intervention (a user program cannot simply allocate
shareable data structures out of its own heap unless the whole heap is
uncacheable) and uncacheable data obviously has a performance hit.

>BTW. You wouldn't happen to know the instruction format for SWP, by any
>     chance? If a software emulator can be written for it for ARM2 machines
>     (like the FPE - or even add it to the FPE) then we can all start using
>     it.

RISC iX 1.2 emulates the SWP instruction on machines which do not support
it.  RISC OS doesn't.  The assembler syntax is:-

         SWP{cond}{B}      Rd, Rm, [Rn]

the semantics (except for the cache behaviour and so on) are:-

         MOV              , Rm
         LDR{cond}{B}     Rd, [Rn]
         STR{cond}{B}     , [Rn]

(ie the SWP Rx, Rx, [Raddr] example above *does* store the *old* Rx value
in [Raddr]... :-).

The instruction format is:-

     bit 31                                                        bit 0
       c.o.n.d.0.0.0.1 0.B.0.0.n.n.n.n d.d.d.d.0.0.0.0 1.0.0.1.m.m.m.m

       c.o.n.d - the condition
       B       - 0 = swap word
                 1 = swap byte
       n.n.n.n - Rn
       d.d.d.d - Rd
       m.m.m.m - Rm

Data aborts (from the memory manager) leave Rd/Rm as they were before.
SWP bypasses the ARM3 cache, although the write operation still updates
the cache (if the address is cached).  I don't know whether the read
will cause the rest of that part of the cache to be updated (I assume
not, and the programmer should not care :-)

John Bowler (jbowler@acorn.co.uk)



From: dseal@armltd.co.uk (David Seal)
Subject: Re: ARM3 instructions.
Date: 4 Sep 92 15:01:12 GMT

In article <4422@gos.ukc.ac.uk> amsh1@ukc.ac.uk (Brian May#2) writes:

>  I don't have an Archie myself but have used them quite a lot in the past.
>I was recently mucking about with a friend's A5000, trying to find the new
>instructions that turned the cache on and off. I found them, they were
>co-processor instructions with the processor itself as (I think) number 0.

Coprocessor 15, in fact.

>  Anyway, as I was disassembling away I found a new instruction (well, I had
>never come across it before). It was 'SWP' and I imagine it swaps registers
>with registers, maybe with memory as well? I can't remember. If it does
>reg<->mem as well, and is uninterruptable, perhaps it is for use as a
>semaphore in multi-processor systems?

The SWP instruction was new to the ARM2as macrocell. I believe ARM3 was the
first full chip which contained it. More recent macrocells and chips like
ARM6, ARM60, ARM600 and ARM610 also contain it.

It only swaps a register with a memory location (either a byte or a word),
and not two registers. It can however read the new contents of the memory
location from one register, and write the old contents of the memory
location to another register - i.e. it doesn't have to do a pure swap. This
may be the source of your idea that it can swap two registers. It is indeed
uninterruptable, and yes, it is intended for semaphores.

>  Of course I won't be the first person to notice this so I wondered, could
>someone post some info on this, and also on the co-processor instructions
>relevant to the CPU itself?

The SWP instruction:
  Bits 31..28: Usual condition field
  Bits 27..23: 00010
  Bit 22:      0 for a word swap, 1 for a byte swap
  Bits 21..20: 00
  Bits 19..16: Base register (addresses the memory location involved)
  Bits 15..12: Destination register (where the old memory contents go)
  Bits 11..4:  00001001
  Bits 3..0:   Source register (where the new memory contents come from)

  Byte swaps use the bottom byte of the source and destination registers,
  and clear the top three bytes of the destination register. There are
  various rules about how R15 works in each register position, similar to
  those for LDR and STR instructions. The destination and source registers
  are allowed to be the same for a pure swap. I don't know offhand what
  would happen if the base register were equal to one or both of the others,
  but I don't think I'd recommend doing it!

  Assembler syntax is (using <> around optional sections):
    SWP Rdest,Rsrc,[Rbase]

The ARM3 cache control registers are all coprocessor 15 registers, accessed
by MRC and MCR instructions in non-user modes. (They will produce invalid
operation traps in user mode.)

Coprocessor 15 register 0 is read only and identifies the chip - e.g.:
  Bits 31..24: &41 - designer code for ARM Ltd.
  Bits 23..16: &56 - manufacturer code for VLSI Technology Inc.
  Bits 15..8:  &03 - identifies chip as an ARM3.
  Bits 7..0:   &00 - revision of chip.

Coprocessor 15 register 1 is simply a write-sensitive location - writing any
value to it flushes the cache.

Coprocessor 15 register 2: a miscellaneous control register.
  Bit 0 turns the cache on (if 1) or off (if 0).
  Bit 1 determines whether user mode and non-user modes use the same address
    mapping. Bit 1 is 1 if they do, 0 if they have separate address
    mappings. It should be 1 for use with MEMC.
  Bit 2 is 0 for normal operation, 1 for a special "monitor mode" in which
    the processor is always run at memory speed and all addresses and data
    are put on the external pins, even if the memory request was satisfied
    by the cache. This allows external hardware like a logic analyser to
    trace the program properly.
  Other bits are reserved for future expansion. Code which is trying to set
    the whole control register (e.g. at system initialisation time) should
    write these bits as zeros to ensure compatibility with any such future
    expansions. Code which is just trying to change one or two bits (e.g.
    turn the cache on or off) should read this register, modify the bits
    concerned and write it back: this ensures that it won't have unexpected
    side effects in the future like turning as-yet-undefined features off.
  This register is reset to all zeros when the ARM3 is reset.

Coprocessor 15 register 3: controls whether areas of memory are cacheable,
    in 2 megabyte chunks. All accesses to an uncacheable area of memory go
    to the real memory and not to the cache - this is a suitable setting
    e.g. for areas containing memory-mapped IO, or for doubly mapped areas
    of memory.
  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are cacheable, 0 if they
    are not.
  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are cacheable, 0 if they
    are not.
  :
  :
  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are cacheable, 0 if
    they are not.

Coprocessor 15 register 4: controls whether areas of memory are updateable,
    in 2 megabyte chunks. All write accesses to a non-updateable area of
    memory go to the real memory only, not to the cache - this is a suitable
    setting for areas of memory that contain ROMs, for instance, since you
    don't want the cached values to be altered by an attempt to write to the
    ROM. (Or, as in MEMC, by an attempt to write to write-only locations
    that share an address with the read-only ROMs.)
  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are updateable, 0 if
    they are not.
  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are updateable, 0 if
    they are not.
  :
  :
  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are updateable, 0 if
    they are not.

Coprocessor 15 register 5: controls whether areas of memory are disruptive,
    in 2 megabyte chunks. Any write access to a disruptive area of memory
    will cause the cache to be flushed. This is a suitable setting for areas
    of memory which if written, could cause cache contents to become invalid
    in some way. E.g. on MEMC, writing to the physically addressed memory at
    addresses &2000000-&2FFFFFF will also usually change a virtually
    addressed location's contents: if this location is in cache, a
    subsequent attempt to read it would read the old value. To avoid this
    problem, the physically addressed memory should be marked as disruptive
    in a MEMC system. Similarly, any remapping of memory on a MEMC or other
    memory controller should act disruptively, since the cache contents are
    liable to have become invalid.
  Bit 0 is 1 if virtual addresses &0000000-&01FFFFF are disruptive, 0 if
    they are not.
  Bit 1 is 1 if virtual addresses &0200000-&03FFFFF are disruptive, 0 if
    they are not.
  :
  :
  Bit 31 is 1 if virtual addresses &3E00000-&3FFFFFF are disruptive, 0 if
    they are not.

Coprocessor 15 registers 3-5 are in an undefined state after power-up: they
must be programmed correctly before the cache is turned on.

Note that you should check the identity code in coprocessor 15 register 0
identifies the chip as an ARM3 before assuming that the other registers can
be used as stated above, unless you are absolutely certain your code can
only ever be run on an ARM3. Otherwise you are likely to run into problems
with other chips - e.g. an ARM600 uses the same coprocessor 15 registers to
control its cache and MMU, but in a completely different way. Just about the
only thing they do have in common is that coprocessor 15 register 0 contains
an identification code as described above.

David Seal
dseal@armltd.co.uk

All opinions are mine only...



From: mhardy@acorn.co.uk (Michael Hardy)
Subject: Re: Risc-OS Documentation
Date: 15 Aug 91 09:45:14 GMT
Organization: Acorn Computers Ltd, Cambridge, England


ARM3 SUPPORT
============


Introduction and Overview
=========================

The ARM3Support module provides commands to control the use of the ARM3 
processor's cache, where one is fitted to a machine. The module will
immediately  kill itself if you try to run it on a machine that only has an
ARM2 processor fitted.


Summary of facilities
---------------------

* Commands are provided: one to configure whether or not the cache is
enabled at  a power-on or reset, and the other to independently turn the
cache on or off.

There is also a SWI to turn the cache on or off. A further SWI forces the
cache to be  flushed. Finally, there is also a set of SWIs that control how
various areas of  memory interact with the cache.

The default setup is such that all RISC OS programs should run unchanged
with  the ARM3's cache enabled. Consequently, you are unlikely to need to
use the SWIs  (beyond, possibly, turning the cache on or off).


Notes
-----

A few poorly-written programs may not work correctly with ARM3 processors, 
because they make assumptions about processor timing or clock rates.


Finding out more
----------------

For more details of the ARM3 processor, see the Acorn RISC Machine family
Data  Manual. VLSI Technology Inc. (1990) Prentice-Hall, Englewood Cliffs,
NJ, USA: ISBN  0-13-781618-9.





SWI Calls
=========



Cache_Control (SWI &280)
========================

Turns the cache on or off


On entry
--------
R0 = EOR mask
R1 = AND mask


On exit
-------
R0 = old state (0 => cacheing was disabled, 1 => cacheing was enabled)


Interrupts
----------
Interrupts are disabled
Fast interrupts are enabled


Processor mode
--------------
Processor is in SVC mode


Re-entrancy
-----------
Not defined


Use
---
This call turns the cache on or off. Bit 0 of the ARM3's control register 2
is altered  by being masked with R1 and then exclusive ORd with R0: ie new
value = ((old  value AND R1) XOR R0). Bit 1 of the control register is also
set, forcing the memory  controller to use the same translation table for
both User and Supervisor Modes  (as indeed the MEMC chip should). Other bits
of the control register are set to  zero.


Related SWIs
------------
None


Related vectors
---------------
None



Cache_Cacheable (SWI &281)
==========================

Controls which areas of memory may be cached


On entry
--------
R0 = EOR mask
R1 = AND mask


On exit
-------
R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable)


Interrupts
----------
Interrupts are disabled
Fast interrupts are enabled


Processor mode
--------------
Processor is in SVC mode


Re-entrancy
-----------
Not defined


Use
---
This call controls which areas of memory may be cached (ie are cacheable).
The  ARM3's control register 3 is altered by being masked with R1 and then
exclusive  ORd with R0: ie new value = ((old value AND R1) XOR R0). If bit n
of the control  register is set, the 2MBytes starting at n*2MBytes are
cacheable.

The default value stored is &FC007FFF, so ROM, the RAM disc and logical
non-screen RAM are  cacheable, but I/O space, physical memory and logical
screen  memory are not.

(You may find a value of &FC007CFF - which disables cacheing the RAM disc -
gives better performance.)


Related SWIs
------------
Cache_Updateable (SWI &282), Cache_Disruptive (SWI &283)


Related vectors
---------------
None



Cache_Updateable (SWI &282)
===========================

Controls which areas of memory will be automatically updated in the cache


On entry
--------
R0 = EOR mask
R1 = AND mask


On exit
-------
R0 = old value (bit n set => 2MBytes starting at n*2MBytes are cacheable)


Interrupts
----------
Interrupts are disabled
Fast interrupts are enabled


Processor mode
--------------
Processor is in SVC mode


Re-entrancy
-----------
Not defined


Use
---
This call controls which areas of memory will be automatically updated in
the  cache when the processor writes to that area (ie are updateable). The
ARM3's control  register 4 is altered by being masked with R1 and then
exclusive ORd with R0: ie  new value = ((old value AND R1) XOR R0). If bit n
of the control register is set, the  2MBytes starting at n*2MBytes are
updateable.


The default value stored is &00007FFF, so logical non-screen RAM is
updateable,  but ROM/CAM/DAG, I/O space, physical memory and logical screen
memory are  not.


Related SWIs
------------
Cache_Cacheable (SWI &281), Cache_Disruptive (SWI &283)


Related vectors
---------------
None



Cache_Disruptive (SWI &283)
===========================

Controls which areas of memory cause automatic flushing of the cache on a
write


On entry
--------
R0 = EOR mask
R1 = AND mask


On exit
-------
R0 = old value (bit n set => 2MBytes starting at n*2MBytes are disruptive)


Interrupts
----------
Interrupts are disabled
Fast interrupts are enabled


Processor mode
--------------
Processor is in SVC mode


Re-entrancy
-----------
Not defined


Use
---
This call controls which areas of memory cause automatic flushing of the
cache  when the processor writes to that area (ie are disruptive). The
ARM3's control  register 5 is altered by being masked with R1 and then
exclusive ORd with R0: ie  new value = ((old value AND R1) XOR R0). If bit n
of the control register is set, the  2MBytes starting at n*2MBytes are
updateable.

The default value stored is &F0000000, so the CAM map is disruptive, but 
ROM/DAG, I/O space, physical memory and logical memory are not. This causes 
automatic flushing whenever MEMC's page mapping is altered, which allows 
programs written for the ARM2 (including RISC OS itself) to run unaltered,
but at  the expense of unnecessary flushing on page swaps.


Related SWIs
------------
Cache_Cacheable (SWI &281), Cache_Updateable (SWI &282)


Related vectors
---------------
None



Cache_Flush (SWI &284)
======================

Flushes the cache


On entry
--------
-


On exit
-------
-


Interrupts
----------
Interrupts are disabled
Fast interrupts are enabled


Processor mode
--------------
Processor is in SVC mode


Re-entrancy
-----------
Not defined


Use
---
This call flushes the cache by writing to the ARM3's control register 1.


Related SWIs
------------
None


Related vectors
---------------
None





* Commands
==========



*Cache
======

Turns the cache on or off, or gives the cache's current state


Syntax
------
*Cache [On|Off]


Parameters
----------
On or Off


Use
---
*Cache turns the cache on or off. With no parameter, it gives the cache's
current  state.


Example
-------
*Cache Off


Related commands
----------------
*Configure Cache


Related SWIs
------------
Cache_Control (SWI &280)


Related vectors
---------------
None



*Configure Cache
================

Sets the configured cache state to be on or off


Syntax
------
*Configure Cache On|Off


Parameters
----------
On or Off


Use
---
*Configure Cache sets the configured cache state to be on or off.


Example
-------
*Configure Cache On


Related commands
----------------
*Cache


Related SWIs
------------
Cache_Control (SWI &280)


Related vectors
---------------
None

******************************************************************************

I hope this helps.

- Michael J Hardy           Email:      mhardy@acorn.co.uk

  Acorn Computers Ltd       Telephone:  +44 223 214411
  Cambridge TechnoPark      Fax:        +44 223 214382
  645 Newmarket Road        Telex:      81152 ACNNMR G
  Cambridge CB5 8PB
  England                   Disclaimer: All opinions are my own, not Acorn's



From: osmith@acorn.co.uk (Owen Smith)
Subject: Re: Risc-OS Documentation
Date: 13 Aug 91 15:06:19 GMT

The ARM3 SWIs really aren't all that interesting, and I've just totally
failed to find a documentation file for them. However, as a tester, here
is a bit of BASIC (courtesy of Brian Brunswick) which marks the RAM disk
area as not cacheable. This in fact makes it go faster.

SYS "Cache_Cacheable", 0, &fffffcff
SYS "Cache_Updateable", 0, &fffffcff

The reason it goes faster is that because such large amounts of data are
being slurped around, the memory copy loop tends to get flushed out of
the cache, particularly since it is a long piece of loop unrolled code
(for speed on an ARM2). So you end up with a cache full of data, very little
of which is ever accessed again before it gets flushed out of the cache by
some more data. The loop does an LDM and STM 10 registers at a time in
RamFS, so in theory there are two words that get cached (ARM3 read 4 words
at a time), but this saving is swallowed up by the cache synchronisation
delays.

You have to be careful though. Brian has his own re-sizing ram disk
which uses the system sprite area. Marking the system sprite are as not
cacheable makes it go slower. We (Brian and I) think this is because he
uses the C function memcpy(), in which the LDM and STM is 4 registers
at a time. Since this is a multiple of four, it hits the ARM bug where
it loads 5 words and then throws the fifth one away, which results in
loading 8 words on an ARM3 (it always reads 4 word chunks even with the
cache off). So with the cache off, you load 8 then throw 4 away, load the
next 8 (including the 4 you just threw away) and throw 4 away etc. So
you are effectively reading all the data twice. With the cache on this
goes down to once. Yes the code will probably get flushed out, but it
is a tight loop (not unrolled) so it is not very likely and the cost of
reloading the code is less than the saving on the data loads.

The moral of this is to be careful with the ARM3 SWIs, and don't just
think that it ought to go faster, do timings, in lots of different screen
modes.

Owen.

poppy@poppyfields.net