Preview

Journal of «Almaz – Antey» Air and Space Defence Corporation

Advanced search

Network-on-a-chip of a new microprocessor generation with the Elbrus architecture

https://doi.org/10.38013/2542-0542-2021-1-103-109

Abstract

The paper describes a network-on-chip of a new microprocessor generation with the Elbrus architecture, taking into account the peculiarities of physical design. The network-on-a-chip under consideration plays a central role in the scaling process of the microprocessor, interconnecting all the main components of the system and ensuring the transfer of all types of packets between devices. The characteristics of the network-on-a-chip determine the bandwidth and access time to the memory subsystem.

For citations:


Kozhin E.S., Kozhin A.S. Network-on-a-chip of a new microprocessor generation with the Elbrus architecture. Journal of «Almaz – Antey» Air and Space Defence Corporation. 2021;(1):103-109. https://doi.org/10.38013/2542-0542-2021-1-103-109

Basically, the current trend in development of multi-purpose high-speed processors is to increase the number of processor cores on a single chip. Although today some developers tend to mount several small-area chips on a common substrate [1], most developers keep designing multicore monolithic chips containing 28 and more processor cores [2]. In addition to processor cores, a chip may carry a growing number of last-level cache memory banks, random-access memory controllers (6–8 controllers per processor) and I/O devices. Thus, a modern high-speed processor is a complex distributed system comprising tens of devices combined into a common system with account for requirements for latencies and data transmission capacity of interconnections..

The previous generation of microprocessors featuring the Elbrus architecture [3] has chips carrying 8 processor cores, 8 level 3 shared cache memory banks, 4 random-access controllers and a set of I/O devices. Two switching environments were used to connect the components: bi-directional ring bus [4] combining processor cores and level 3 cache memory, and a centralized switch connecting the level 3 cache memory and memory access devices, plus I/O devices [5].

For design of a new family of high-speed Elbrus processors, the task was to increase the number of processor cores up to two times, from 8 to 12–16. Balancing the system required an increase in the level 3 cache memory volume and the data transmission capacity for access to RAM and I/O. Thus, the number of devices for the microprocessor to be developed was doubled, so the need emerged to modify the system of their interconnections. As the key solution, a distributed network-on-a-chip was selected, where subscribers were connected through network adapters to a common distributed switching environment of any topology [6]. This approach allows to connect the previously developed units such as processor cores, cache memory banks and memory and IO access controllers to a common system by changing only network parameters and adapters.

Building a network-on-a-chip requires a bunch of problems to be solved, from development of a network protocol and a networking principle to design of individual switches. This paper is intended to select a network topology and a system of network node interconnections and to develop network adapters for devices.

Network-on-a-chip structure

For development of a new generation of microprocessors with the Elbrus architecture, the tiled arrangement design was selected. With this approach, the core and the nearest level 3 cache memory bank are connected to a network switch through network adapters, as shown in Figure 1.


Fig. 1
. Tile structure

The core, memory bank and switch have identical numbers. In such a structure, called the tile structure, only network switch ports connecting tiles to a common network can be used as external interfaces.

Any data exchange between devices of the 6th generation network-on-a-chip microprocessors with the Elbrus architecture is enabled via messages in accordance with the proprietary system protocol. There are five types of messages: initial requests, coherent snoop requests, replies with data, replies without data, acknowledgements. Each message is transmitted as a packet of particular size. There are individual physical and virtual channels for transmitting different-type packets between network nodes.

Each message contains information that allows to identify the exact type and number of a destination device. Based on this information, a network adapter identifies the packet destination network address once a packet is in the network. The destination address depends on the routing table and network topology. After the destination address is computed, it will be included in a packet and used while passing all routers. Some messages such as coherent snoop requests and replies with data may have several destination addresses. In this case, a message shall be delivered following the routes that correspond to all destination addresses. To save the network capacity, messages with several destination addresses are transmitted over the network as a single packet and split at network nodes when delivery routes no longer match. Once a packet reaches the tile with the desired destination address, it will be output to the device through the network adapter. As the tile contains 2 subscribers, the packet destination address has information that allows to unambiguously identify the destination device in the tile.

Topology and addressing

There exists a fairly large number of various network topologies, but unlike computer networks where devices can be interconnected in any topology, networks-on-a-chip have additional physical constraints. The main constraints are planar connection of units and no option for using diagonal wires. Besides, communication tracing above processor units deteriorates their time characteristics.

For building networks-on-a-chip with tens of nodes, the 2d mesh topology is mostly used. Figure 2a shows an example of such a network with the number of tiles equal to 16. Cartesian coordinates along X-axes from 0 to 1 and along Y-axis from 0 to 7 are assigned to all tiles. The network address is generated as a combination of X- and Y-coordinates, plus the number of the network adapter at the destination node. Such a topology allows to implement a very simple X-Y or Y-X routing without packet interlocking in the network. With the X-Y routing, a packet is initially transmitted along X-axis until it reaches the coordinate of the destination node, and then along Y-axis. The 2d mesh topology is well scaled up to 256 nodes.

If the 2d mesh topology is selected, it is important that the number of nodes along both coordinates is the same, otherwise the system may be unbalanced. To estimate network capacity, the bisectional width parameter is applied. The parameter is the number of links splitting the cut of the network into two parts with the equal number of nodes in its bottleneck [6]..

As an example, let us analyse a system with 16 processor cores. A 4×4 square is considered to be a balanced system with the bisectional width parameter equal to 4. The 2d mesh 2×8 topology, shown in Figure 2a, is unbalanced with the bisectional width parameter equal to 2. In other words, if the packet traffic is random, the network capacity of the 2d mesh 2×8 topology is two times less than that of the 2d mesh 4×4 topology.

In practice, a balanced 2d mesh topology cannot be used under certain circumstances only. At the design stage of the 6th generation processors with the Elbrus architecture, there was the requirement for arranging processor cores in two columns. This layout excluded the application of the 2d mesh 4×4 topology. The application of the 2d mesh 2×8 topology was ineffective due to a low value of the bisectional width parameter.

The network bisectional width parameter may be increased by adding extra links. The 2d mesh 2×8 topology uses 8 horizontal links, thus enabling a high packet transmission capacity along X-axis, but with only 2 vertical links at each Y-coordinate, limiting the network capacity. Therefore, in order to increase the bisectional width parameter value, vertical links need to be added. To solve the problem, we have selected the 2d torus-mesh topology, which, unlike the 2d mesh 2×8 topology, has 6 vertical links. Its structure is shown in Figure 2b. It is easier to analyse this topology in a 3D environment, where it will look like 3d mesh 2×2×4. At that, the bisectional width parameter is equal to 4, similar to the 2d mesh 4×4 topology. The network address is generated as a combination of X-, Y-, Z-coordinates and the number of the network adapter at the destination node. For routing, it would be easier to use static routing X-Y-Z, which allows to avoid packet interlocking within the network.


Fig. 2
. Network-on-a-chip topology diagrams: a – network with 2d mesh topology; b – network with 2d torus-mesh topology

At the network-on-a-chip design stage, we conducted a comparative analysis of various network-on-a-chip topologies for the 6th generation processors with the Elbrus architecture [7]. Topologies for 8-, 12- and 16-core processors were analysed. The network packet delivery latency and the attainable network capacity were considered comparative characteristics. According to the study, the application of the 2d torus-mesh topology for 8 processor cores does not give any advantage over the 2d mesh topology. For 12 processor cores, the 2d torus-mesh topology allows to increase the network capacity from 55 % (6.66 packets per clock cycle for the whole system) to 92 % (10.99 packets per clock cycle for the whole system) of the maximum possible network capacity equal to 12 packets per clock cycle. For 16 processor cores, the 2d torusmesh topology allows to increase the network capacity almost 2 times, from 38 % (6.15 packets per clock cycle for the whole system) to 70 % (11.04 packets per clock cycle for the whole system) of the maximum possible network capacity equal to 16 packets per clock cycle.

If the selection of the 2d torus-mesh topology for a 16-core processor with cores arranged in 2 rows gives obvious advantages, for a 12-core processor, there is no significant increase in the network capacity due to implementation of a more complex topology. For the 2d torus-mesh topology, a network switch shall have four network ports, while for the 2d mesh 2×6 topology, three network ports will be enough. This allows to reduce the complexity of equipment and power consumption. That is why, the selection of topology for a 12-core processor basically depends on limitations of power and required performance of the network-on-a-chip.

Network adapters and connection of devices

A set of network adapters is determined by a number of various types of devices to be connected. The 6th generation processors with the Elbrus architecture feature 4 types of network subscribers: processor cores, Level 3 cache memory banks, RAM access devices and devices for access to I/O and interprocessor links. Level 3 cache memory banks and RAM access devices operate with the network-on-a-chip at the same frequency. Processor cores and a device for access to I/O and interprocessor links operate at independent frequencies and are not synchronized with the network-on-a-chip. To connect asynchronous devices, it is necessary to use a synchronizer between the network-on-a-chip frequency and the device frequency [9]. A network adapter for such devices is located between the network-on-a-chip and a synchronizer and operates at the network-on-achip frequency.

Generally, a network adapter shall transmit all the types of packets to and from the device to be connected. But, due to some peculiarities of the protocol, some packet transmissions are disabled, for example, an initial request cannot be transmitted to the processor core. Therefore, network subscribers have the requited set of interfaces only, while a network adapter matches this set with the network interfaces. As a packet passes from a device to a network adapter, it acquires the network destination address that depends on the type of the packet to be transmitted. For initial requests, the network address is generated based on the type of operation and request address. Coherent snoop requests contain a bit vector of destination devices, which is generated by the packet source. Reply packets with data can be transmitted to two devices simultaneously, the number of recipients and their network addresses are identified by the extension of the operation code and two unique identifiers, which are transmitted within a packet. Short packets are always transmitted to a single device only, the address of which is determined by a unique packet identifier.

Network adapters of processor cores and level 3 cache memory banks are contained in all switches. Network adapters for memory access devices and the device for access to I/O and interprocessor links are mounted in these devices. To connect memory access devices, a 16-core processor uses free ports of switches 0–3 and 12–15 in order to ensure the maximum access rate, as shown in Figure 3. The I/O access device was connected by adding auxiliary switches (COM) in the cut of the network between coordinates 1 and 2 along Z-axis.


Fig. 3
. Network-on-a-chip topology diagram for microprocessor with 16 processor cores

Conclusion

As a result, the paper proposes a networking principle using networks-on-a-chip for the 6th generation processors with the Elbrus architecture with account for physical constraints at the design stage. Network adapters have been developed for all applicable processor devices. Networks-on-a-chip have been built with the 2d mesh and 2d torus-mesh topologies for 8-, 12- and 16-core processors and their comparative analysis has been conducted. According to the obtained results, the best solution for a 16-core processor is the application of the 2d torus-mesh topology.

References

1. Suggs D., Subramony M., Bouvier D. The AMD “Zen 2” Processor // IEEE Micro. 2020. V. 40. № 2. P. 45–52.

2. Tam S. M., et al. Skylake-SP: A 14nm 28-core xeon® processor // 2018 IEEE International Solid-State Circuits Conference (ISSCC). IEEE. 2018. P. 34–36.

3. Kostenko V. O. et al. Elbrus-8C: The Latest Yield from MCST and MIPT Collaboration // 2015 International Conference on Engineering and Telecommunication (EnT). IEEE, 2015. P. 67–68.

4. Кожин А. С., Сахин Ю. Х. Коммутация со- единений процессорных ядер с общим кэшем третьего уровня микропроцессора “Эльбрус- 4С+” // Вопросы радиоэлектроники. 2013. № 3. С. 5–14.

5. Альфонсо Д. М., Деменко Р. В., Кожин А. С. и др. Микроархитектура восьмиядерного универсального микропроцессора «Эльбрус-8C» // Вопросы радиоэлектроники. 2016. № 3. С. 6–13.

6. Jerger N. E., Peh L.-S. On-chip networks // Synthesis Lectures on Computer Architecture. 2009. V. 4. No. 1. 141 p.

7. Кожин А. С., Нейман-заде М. И., Тихорский В. В. Влияние подсистемы памяти восьмиядерного микропроцессора «Эльбрус-8С» на его производительность // Вопросы радио- электроники. 2017. № 3 Сер. ЭВТ. С. 13–21.

8. Кожин А. С., Кожин Е. С., Шпагилев Д. И. Исследование топологий сетей на кристалле многоядерных процессоров с архитектурой «Эльбрус» // Электроника: НТБ. 2020. № 7.

9. Кожин А. С. Проблемы передачи данных между асинхронными доменами вычислительного устройства // Вопросы радиоэлектроники. 2011. № 3. Сер. ЭВТ. С. 130–141.


About the Authors

E. S. Kozhin
MCST JSC
Russian Federation

Kozhin Evgeny Sergeevich – Lead Engineer. Research interests: microprocessors and computing systems.

Moscow, Russian Federation



A. S. Kozhin
MCST JSC
Russian Federation

Kozhin Alexey Sergeevich – Sectoral Head. Research interests: microprocessors and computing systems.

Moscow, Russian Federation



Review

For citations:


Kozhin E.S., Kozhin A.S. Network-on-a-chip of a new microprocessor generation with the Elbrus architecture. Journal of «Almaz – Antey» Air and Space Defence Corporation. 2021;(1):103-109. https://doi.org/10.38013/2542-0542-2021-1-103-109

Views: 749


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2542-0542 (Print)