Implementing a ﬂexible network stack

(1)

Implementing a flexible network stack

Lasse Bang Dalegaard

Kongens Lyngby 2013 IMM-B.Sc.-2013-0028

(2)

Technical University of Denmark Informatics and Mathematical Modelling

Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673

reception@imm.dtu.dk

www.imm.dtu.dk IMM-B.Sc.-2013-0028

(3)

Summary (English)

The diverse needs of modern networking technology have spawned flexible, high- performance packet processing engines like the network stacks in Linux and FreeBSD or the Click Modular Router. All of these provide high performance infrastructures that can be used to create complex networked systems. The large footprint of these systems however limit their applicability to relatively capable systems.

In this project, we design and implement Trokis, a minimal set of components

that together can be used to create any network stack, ranging in scope from

small embedded systems to large general purpose systems, like servers and work

stations. We also implement several protocols on top of our framework, to

showcase the modularity of our approach.

(4)

ii

(5)

Summary (Danish)

De forskelligartede krav i moderne netværksteknologi har affødt højt-ydende systemer som netværksstakkene i Linux og FreeBSD, og den modulære routing platform Click. Disse udgør hver for sig en platform for skabelse af komplekse netværkssysstem, men deres størrelse gør dem uegnede til systemer med be- grænsede resourcer, f.eks. indlejrede systemer.

I dette projekt har vi designet og implementeret Trokis, et minimalt sæt kompo-

nenter der tilsammen kan bruges til at skabe ekstremt fleksible netværksstakke,

både til indlejrede systemer og til systemer med brede formål, f.eks. servere eller

arbejdsstationer. For at vise modulariteten i vores metode designer og udvikler

vi desuden et sæt protokoller oven på Trokis.

(6)

iv

(7)

Preface

This results was prepared at the department of Informatics and Mathematical Modeling at the Technical University of Denmark, in partial fulfillment of the requirements of the degree of Bachelor of Science in Electrical Engineering.

This projects focuses on topics in computer science, some experience with ob- ject oriented programming, along with basic knowledge of networking systems, is assumed on the part of the reader.

2013-07-31 Lyngby

Lasse Bang Dalegaard

(8)

vi

(9)

Acknowledgements

I would like to thank my supervisor Sven Karlsson for input during the project.

(10)

viii

(11)

Summary (English) i

Summary (Danish) iii

Preface v

Acknowledgements vii

1 Introduction 1

2 Networks and protocols 3

2.1 Prevailing technologies . . . . 3

2.2 Packet network fundamentals . . . . 4

2.3 Protocol stack architecture . . . . 4

2.3.1 Concepts of layering . . . . 5

2.3.2 OSI layering model . . . . 6

2.3.3 TCP/IP model . . . . 7

2.3.4 IEEE 802® layers . . . . 8

2.3.5 End-to-end principle . . . . 10

2.4 Modern network card architecture . . . . 11

2.4.1 Ingress path . . . . 11

2.4.2 Egress path . . . . 13

3 A basic framework for protocol implementations 15 3.1 Prototype implementation basics . . . . 15

3.2 Packet payload management . . . . 16

3.3 Processing elements . . . . 19

3.4 Handling concurrency . . . . 23

3.5 Safe memory reclamation . . . . 25

(12)

x CONTENTS

4 Protocols 29

4.1 Generic dispatch block . . . . 29

4.2 IEEE 802.3 implementation . . . . 30

4.3 IPv4 implementation . . . . 30

4.4 IPv6 implementation . . . . 35

4.5 Interfacing between external systems . . . . 36

5 Testing 39 5.1 Functional testing of implementation . . . . 39

5.2 Performance testing . . . . 44

6 Relation to existing solutions 49 6.1 Linux and FreeBSD network stacks . . . . 49

6.2 The Click Modular Router . . . . 51

6.3 Lightweight ANA . . . . 53

7 Conclusion 55 8 Future work 57 A Test code and data 59 A.1 sweeper.sh . . . . 59

A.2 bootstrap_mean.r . . . . 60

A.3 Performance benchmark initial data set . . . . 60

A.4 Performance benchmark optimized data set . . . . 61

B Trokis test implementation 63

C Trokis test implementation with extra optimizations 149

Bibliography 235

(13)

Chapter 1

Introduction

In recent years, the internet has seen extremely rapid growth, in terms of number of connected devices. One paper by Cisco, puts the total number of connected devices at more than 12 billion[1]. Each of these devices runs an entire protocol stack, allowing it to communicate with the outside world. Traditionally, protocol stacks have been built directly in to the operating system kernels, allowing relatively high performance. The stacks are however very advanced, internally tightly coupled, and allow limited modularity. In some cases, a general purpose stack, like the one implemented in Linux, is simply too heavy weight for a small microprocessor. Recent research in the field has yielded systems like Click[2]

and Lightweight ANA[3]. These systems are extremely flexible, allowing high performance and great modularity, but are also quite heavy weight for smaller systems.

The goal of this project is to implement a lightweight, flexible userspace network

protocol stack that can be adapted to a specific application with little or no

change to individual components. The stack should be flexible enough to allow

it to run completely in user-mode on any given operating system, ranging from

high performance servers to embedded device. We emphasize that the stack

should run in user mode, as this allows us to avoid many of the problems that

plague kernel code and therefore most modern network stacks. It also gives

us the opportunity to exploit modern operating system features like access to

virtual memory and a complete standard library implementation along with the

(14)

2 Introduction

newest advances in programming language technology. This approach allows a highly modular protocol stack architecture.

To show the flexibility gained with our approach, we show a relatively simple implementation of the IPv4 and IPv6 protocols and select transport layer proto- cols. The goal of this implementation is not to create a complete implementation of either of these internetworking protocols, but instead to focus on how the dif- ferent components are connected, and what common infrastructure is required for them to work together. We will also show how this flexible scheme can be tied together with modern multi-queue enabled networking devices.

Our stack is implemented in C++ using the newest features from C++11. C++

was picked because it is a systems programming language that focuses on high performance and providing excellent abstractions for integration with low level systems. It is often supported on embedded and general purpose platforms alike and widely used.

During the course of this project, we have assumed a x86_64/AMD64 as the processor architecture. When running on embedded platforms, this will of course not be a perfect approximation in any way. While our stack tries to maintain portability where possible, several (mainly performance related) aspects will always be specific to a given processor.

We’ve chosen to name our stack Trokis, a loose transliteration of the Ancient

Greek word τρóχις meaning courier or messenger[4].

(15)

Chapter 2

Networks and protocols

In this chapter we analyze the requirements of a modern ethernet based network stack.

2.1 Prevailing technologies

The International Telecommunication Union estimates that 2.7 billion people will be using the internet in 2013[5], corresponding to about 39% of the world population. Underlying the global Internet is the TCP/IP protocol suite. The IP protocol, the basis of the TCP/IP protocol suite, allows global addressing and is suitable for internet-scale applications. It is able to run over any type of packet switched network. The most widely deployed version of the IP protocol, IPv4, uses a 32-bit address as a global identifier, while a newer version, IPv6, uses 128-bit addressing with a new allocation scheme in order to ensure better scalability for many years to come[6].

While IP is able to run over any packet switched network, at end hosts(ie.

servers, desktops and laptops) it is often transported over either wired ether- net(IEEE 802.3), Wi-Fi(IEEE 802.11) or WiMAX technologies(IEEE 802.16).

All of these technologies are IEEE 802® standards and therefore follow the

(16)

4 Networks and protocols

IEEE 802® model. A protocol stack implementing support for all three would therefore be able to interoperate with a large number of available devices.

Cellular systems generally do not use the IEEE 802® family of standards for transporting IP, instead utilizing technologies like SNDCP(Sub Network De- pendent Convergence Protocol) for GRPS(General Radio Packet Service, part of the Global System for Mobile Communications, GSM)[7] or PDCP(Packet Data Convergence Protocol) for UMTS(Universal Mobile Telecommunications System, a third generation(3G) cellular technology)[8].

2.2 Packet network fundamentals

When transporting data over a communication channel, two modes of commu- nication are generally applicable: packet-switched mode and circuit-switched mode. In a circuit-switched system, a dedicated communication channel is estab- lished between nodes, thus the claimed resources are dedicated to this channel alone. This guarantees the communication channel all the allocated bandwidth, without the possibility of others interfering. This scheme is used in eg. the Public Switched Telephone Network(PSTN)[9]. In the PSTN system, channels were originally set up physically by mechanical equipment moving wires around, but in modern times this is handled digitally by setting up a 64 kbps digital virtual circuit, typically over Synchronous Digital Hierarchy(SDH)[9].

In contrast to circuit switched networks are packet switched networks. In a packet switched network, data streams are split into multiple packets which are then transported over the network as individual units, and then processed at the end node, where they are combined or interpreted in some way. Packets can be either fixed-size(ie. cell-based, eg. Asynchronous Transfer Mode(ATM)) or variable-size(eg. all IEEE 802® standards), depending on the underlying technology used. Further more, packet mode can either be connectionless(eg.

Ethernet) or connection-oriented(eg. Frame Relay)[9]. The connection-oriented mode is also known as virtual-circuit switching.

2.3 Protocol stack architecture

Layering is a heavily used architectural viewpoint in the networking field. Here

we will discuss some of the prevailing architectural viewpoints and how the

previously discussed prevailing technologies fit in this picture.

(17)

2.3 Protocol stack architecture 5

2.3.1 Concepts of layering

It is generally recognized that a large part of the success of the internet is to be found in the layered architecture popularized by especially the TCP/IP suite[10]. The idea of layering is to introduce a framework that describes how different protocols work together in order to provide a specific functionality for the application running on top of them. The following are examples of layer functionalities, taken from the IETF model, the model underlying the TCP/IP suite[11]:

Error Control This functionality provides reliability for the communication protocol. An example could be the retransmission scheme of TCP.

Flow Control Control of channel data rate, for example to avoid congestion.

Fragmentation Breaks large chunks of data into smaller pieces. Examples include IP fragmentation and TCP segmentation.

Multiplexing Several higher level connections multiplexed over a single lower level connection.

Connection Setup Peer hand-shake, for example TCP three-way handshake or SCTP four-way handshake.

Addressing/Naming Management of host and entity identifiers.

Any arbitrary functionality besides the above could of course be provided by a layer. For example, IPsec[12] provides authentication and/or confidentiality as well as the normal functionality of the IP protocol.

The idea in layering is that a protocol at a given layer will only provide services to protocols from higher layers, and will only use services from lower layers.

The benefit of this approach is clear: it allows us to "stack" protocols on top of each other, combining the functionalities from several protocols as required.

An example is the use of a TCP connection over IP over ethernet. While there are more subtleties to this example, we can think of the "resulting protocol" as having the properties of all these protocols together:

TCP Provides error control, flow control, fragmentation and connection setup IP Provides routing along with global addressing in the form of IP addresses.

Ethernet Provides local transport.

The resulting protocol will thus be reliable and be able to run over the global

internet. This protocol would be great for web and mail traffic, since these re-

quire guaranteed delivery of data to function correctly. On the other hand, if

we were to write a voice over IP application, we would not need the retransmis-

sion(error control) provided by TCP. Since the protocol architecture is layered,

we can simply replace the TCP layer at the top with for example a UDP layer,

(18)

6 Networks and protocols

or run directly on top of IP. The resulting protocol stack would retain the global addressing and local delivery properties of the former protocol stack, but would not include error control, flow control, connection setup, etc. The layered archi- tecture thus provides great advantages for real world networking systems.

This protocol layering is present in (at least) two forms in network systems: as implementation-specific software layers and as precisely defined protocol speci- fications. This distinction is important. When working between hosts across a network, protocols need to be exactly specified, so they can be encoded and de- coded correctly. Failure to specify a protocol exactly could result in eg. endian issues or seemingly invalid packets. Inside a host however, the software man- aging the protocol layers is implementation defined, and does not specifically need to be layered at all, eg. the OSI model is heavily focused on layering in all parts of the architecture whereas the TCP/IP model in several cases discour- ages layering in the implementation. While one is "forced" to use the layering for the actual protocols, the software side could potentially be optimized in several ways if some other architecture but layering is used.

2.3.2 OSI layering model

The OSI model[13] is a generic layering model diving the protocol stack into seven different layers as shown in Table 2.1. Note that the model does not specify any layers above 7 or below 1.

# Name Functionality

7 Application Provides services to the application

6 Presentation Provides conversion between layer 7 and 5, eg. encryption 5 Session Manages sessions between applications

4 Transport Provides end-to-end connections, reliability and flow control 3 Network Provides logical addressing and routing

2 Data Link Provides physical addressing

1 Physical Provides transmission over physical medium Table 2.1: The seven layers of the OSI model

Within each of the seven OSI model layers are a number of entities, each of

which is allowed to use the services of entities in the layer immediately below

it. It is also allowed to interact with other entities at the same level, either

directly or through a proxy in a lower layer. Every layer has a relatively fixed

set of responsibilities. Communication between layers is performed through a

conceptual port construct known as a Service Access Points(SAP).

(19)

2.3 Protocol stack architecture 7

The model also defines the terms Service Data Unit(SDU) and Protocol Data Unit(PDU), and one of each is defined for every layer. The PDU of layer N represents data that has been encapsulated by the layer N. Similarly, the SDU of a layer N represents the PDU of the layer N+1. When moving down the layers, the PDU is passed to the layer below, converting the PDU to a SDU. The layer entity hereafter performs some form of encapsulation, typically inserting headers and/or footers, resulting in a new PDU. The reverse process happens when moving up through the layers.

This model of encapsulation and decapsulation matches the actual protocol behavior used in real protocols.

The OSI model provides a good conceptual model of network protocol layering, and is useful for describing many systems, but it breaks down with many real world protocols. For example, the TCP protocol delivers end-to-end connections via ports, but also manages inter-application sessions. This puts it at both layer 4 and 5 of the OSI model, but if strictly following the OSI model, this is not allowed. Another example is the ARP protocol(and the IPv6-equivalent, Neighbor Discovery Protocol) for translating logical IPv4 addresses into physical link-layer(typically MAC) addresses. This means the ARP protocol performs part of both OSI layers 2 and 3.

Furthermore, some concerns, such as security, are allowed to affect more than one layer, again breaking the strict layering.

Despite these shortcomings the OSI model is heavily used as a guidance tool when constructing networking systems in the real world.

2.3.3 TCP/IP model

The TCP/IP model is the underlying model for the Internet Protocol Suite[14].

Unlike the OSI model, which is a generic model of networking systems, the TCP/IP model can be considered an implementation of the OSI model. There are however major differences in the approach taken between the two models.

Whereas the OSI model focuses heavily on layering, the TCP/IP model favors

architectural principles to layering[15]. The TCP/IP model does specify four

layers but, unlike the OSI model, these are not numbered, only named. The

layers can be approximately related to the OSI model as shown in Table 2.2

As seen in Table 2.2, the TCP/IP model does not concern the physical layer,

and several of the upper layers are combined. The table is only approximate

however, as eg. TCP provides several parts of both the Session and Transport

OSI layers. Instead of focusing on layering, the TCP/IP model focuses on a

few architectural design goals, especially simplicity, scalability, robustness and

running code[15][14][11].

(20)

8 Networks and protocols

Table 2.2: Approximate relation between OSI and TCP/IP layers TCP/IP layer OSI layer # OSI layer name

Application

7 Application

6 Presentation

5 Session

Transport 4 Transport

Internet 3 Network

Link 2 Data Link

- 1 Physical

The TCP/IP suite specifies that all protocols must use IP[16] as their Network layer protocol, in order to ensure interoperability between systems. The TCP/IP suite also favors a link-layer agnostic approach[14]: the TCP/IP suite must be able to run over any link-layer protocol, in order to decouple the two and allowing independent development and evolution at either layer. Therefore, the TCP/IP suite itself specifies no link-layer but only a set of requirements for the link layer protocols supporting it[16]. Some specifications for transporting IP over various link-layer protocols do however exist, notably RFC 894(IP over Ethernet II networks)[17] and RFC 1042[18](IP over 802 networks).

2.3.4 IEEE 802® layers

Ethernet and WiFi are both defined by IEEE 802® standards, defined respec- tively by the working groups 802.3 and 802.11. All IEEE 802® standards share the same layering Architecture[19], matching the OSI layering model with a Physical and a Data Link layer. The Physical layer of course defines the signal- ing and physical characteristics that devices must follow. This is technology- specific. The IEEE 802® standards split the Data Link layer into two sublayers, namely the lower-level Medium Access Control(MAC) sublayer and the higher level Logical Link Control(LLC) sublayer, connected via service access points.

Optionally, some IEEE 802® LAN standards provide an additional sublayer, the Ethernet sublayer, operating on top of the MAC layer in the same way as the LLC sublayer. The MAC sublayer is technology-specific, but the LLC and Ethernet sublayers are shared among all IEEE 802® standards.

The MAC layer is a purely frame-based, connectionless layer, responsible for

data transfer between hosts on the local network. Frame-level functions are all

handled at the MAC layer, including frame delimiting and recognition, physi-

(21)

2.3 Protocol stack architecture 9

cal addressing via a 48-bit MAC-address, error protection(eg. checksumming), and of course payload transfer. The MAC layer can also implement other fea- tures beyond this, for example flow control(eg. PAUSE frames in 802.3x) or encryption(eg. 802.11i, Wi-Fi Protected Access 2). The MAC layer is typically implemented as software in the host protocol stack, although it can of course make use of hardware offloads if the underlying network interface device pro- vides them.

As noted, the MAC layer is responsible for performing encapsulation of LLC and optionally Ethernet frames. The choice between either of these is defined by the network standard in question, although most IEEE 802® networks always use LLC, with 802.3(ie. wired ethernet) being the exception, typically using the Ethernet sublayer directly[17].

The Ethernet sublayer is a relatively simple encapsulation, consisting of a two octet protocol number(known as the ethertype), followed by packet data. An ethertype value in the range 0-1500(both inclusive)[20] indicates the the frame is using LLC encapsulation, and in this case the ethertype specifies the length of the payload of the frame(ie. the LLC payload length). An ethertype above or equal to 0x0600(1536) is used directly for dispatching to a higher level protocol.

The list of ethertypes is maintained by IANA[21]. The Ethernet sublayer thus provides unacknowledged connectionless packet transmission.

The LLC sublayer is responsible for providing several services on top of the MAC layer[22]. LLC provides three different service types on top of MAC, namely:

Type 1 Unacknowledged Connectionless-mode Type 2 Connection-mode

Type 3 Acknowledged Connectionless-mode

Only Type 1 is required for standards compliance[22] and this type is used in the vast majority of cases, as session handling is typically done at the transport layer, for example in TCP. Furthermore, RFC 1042 recommends the use of Type 1 LLC when running IP and ARP over LLC[18].

The LLC layer also provides multiplexing capabilities by way of the DSAP(Destination

Service Access Point) and SSAP(Source Service Access Point). The DSAP in-

dicates the upper layer SAP(ie. protocol) that should receive the encapsulated

information contained in the LLC frame. The SSAP indicates the SAP(ie. pro-

tocol) of the sending protocol. Both DSAP and LSAP are 8-bit fields, and when

both have values of either 0xAA or 0xAB, the LLC frame is Subnetwork Access

Protocol(SNAP) packet[19]. In this case, the initial five octets of the LLC frame

make up the SNAP header, and the first three octets of this header constitutes

an OUI, while the final two octets are the protocol number, vendor(OUI)-specific

number. In the case where the OUI value is all-zeros, the inner frame of the

(22)

10 Networks and protocols

LLC/SNAP packet is an Ethernet sublayer packet. The SNAP protocol number in this case directly matches the Ethernet sublayer ethertype value.

2.3.5 End-to-end principle

The end-to-end principle specifies that some piece of functionality should be implemented in the end-nodes of a communication path, provided that the im- plementation can be done completely and correctly[23].

The end-to-end principle is extremely interesting in the context of a protocol stack, because of it’s implications in the placement of various components. For example, many modern operating system kernels provide the TCP and UDP protocols in the kernel[24, 25]. Following the end-to-end principle however, it could be argued that there is no reason for these protocols to live in the kernel. The end-node would in this case be the user-space application using the TCP or UDP protocol. There is is no obvious reason preventing the user- space application from implementing (at least most of) the required functionality

"completely and correctly", so it follows by the end-to-end principle that TCP and UDP should be provided in the user space application. Of course, there is a small extra detail to this, namely the fact that the protocol stack will still need to be able to perform some inspection of TCP and UDP, in order to forward the correct ports to the correct user-space application. This is however only a small part of the functionality provided by at least TCP, and this small part of the functionality could be provided by the kernel, and the rest could(eg. connection setup, flow control, etc.) could be provided in the user-space application.

Applying end-to-end principle can improve performance in several internet ap- plications. In the example of TCP and UDP, the kernel maintains a buffer of received data(or packets in the case of UDP) which the application must then retrieve from the kernel. The buffer typically also has a fixed size. This intro- duces a feedback loop where the kernel receives data, notifies the application to pull out the data, whereafter the kernel can again receive new data.

Another example of performance improvements by moving functionality to the end-hosts is seen in IPv6, where fragmentation is never performed by interme- diate routers, unlike in IPv4 where routers are supposed to fragment packets larger than the underlying media can support. IPv6 instead mandates that all implementations must support packets of at least 1280 bytes[26], and that the end-hosts must perform Path MTU Discovery[27] in order to determine if the path between them can support a larger MTU.

It is important to remember that the end-to-end principle is just that, a prin-

(23)

2.4 Modern network card architecture 11

ciple. It is not a rule that specifies exactly how things must be done, but it provides a good guideline when building network systems.

2.4 Modern network card architecture

Modern ethernet networks can achieve speeds in excess of 10 Gbps. The mini- mum length of an 802.3 ethernet frame is 64 octets, however the physical layer adds a preamble and interframe gap, resulting in a minimum allowed frame on a wired 10 Gbps ethernet connection of 84 octets. This results in a maximum frame rate of 14.88 Mpps(millions of packets per second), which the sender and receive must be able to process and possibly sustain for extended periods. The total time between each arriving frame is thus only 67.2 ns. This corresponds to only 201.6 cycles of a 3 GHz CPU. The CPU may need to perform route lookups, firewall lookups, consult a connection tracking system, pass the data to an application(which must interpret the data) and more, processing every frame in 200 cycles is not very realistic. Modern systems mitigate this by using multiple parallel processing cores, each working independently. As much of the packet processing is (close to) embarrassingly parallel, the task can be efficiently be split between CPU, allowing almost linear speedup. Separate solutions are implemented to accomplish this in both egress and ingress directions.

Furthermore, modern network interface cards make extensive use of DMA in order to perform as much copying as possible without host CPU intervention.

It should be noted that different cards support different addressing widths, po- tentially limiting the amount of memory available for packet buffering. This should be taken into account in the design.

Most modern network interfaces further offer many different offloading function- alities, allowing the host CPU to use less time on packet processing. Examples of advanced offloading include header splitting(ie. splitting the received frame in header and payload), stateless packet parsing(parsing VLANs, protocols, MPLS tags, etc.) and various stateful offloads eg. TCP offloading.

2.4.1 Ingress path

The ingress path of a modern network interface card[28] implements multiple

independent receive queues that are used for frame reception. The number of

queues available varies, but for the Intel 82599-based cards used for this project,

128 seperate queues are available. Frames are generally moved directly to system

memory using DMA, and in some cases the CPU caches are also primed with

(24)

12 Networks and protocols

the data, allowing lower latency.

Upon reception of a frame, the network interface card will perform various processing steps on the frame, finally placing picking a queue that must receive the packet.

The queues are implemented as ring buffers, and as such they are of a fixed capacity and of FIFO-ordering. When the host is ready to receive a packet, it inserts a so called receive descriptor into the ring. This descriptor includes flags and addresses the network interface card will use when writing the packet to system memory. Upon reception, network card will fetch the next descriptor in the ring, and if it encounters a receive descriptor it will use the data within to push the received packet directly to main memory using DMA. The network interface card then performs a so called descriptor write back, updating the receive descriptor with additional data. The packet has now been moved to main memory and is ready for processing by the host. The host may either busy wait(or poll) on the ring buffer, or may receive interrupts from the network interface. Which strategy is chosen depends on a number of factors, as interrupts and busy waiting have different trade offs. Most modern network stacks allow both modes and adaptively switch between them when the rate of interrupts generated gets too great.

Since there are multiple queues active, each one may be assigned to a separate processor, using a load-balancing scheme generally termed Receive-Side Scal- ing(RSS)[28]. In this mode, the network interface card generates a hash of some of the headers of the receive packet. Generally a "level 4" hash is used, referring to the OSI transport level. In such a hash, level 4 and usually also level 3 data is used to generate a hash for the packet. For UDP/IP and TCP/IP packets, port numbers along with IP source and destination are normally used to generate the hash. The resulting hash is then used to select a receive queue for recep- tion of the packet. This is done to ensure that a single flow is always steered through the same queue, in order to ensure in-order delivery. While some pro- tocols(eg. TCP) can cope with out of order packets, performance is reduced in these cases. While this load-balancing scheme does not ensure a perfect split of traffic between receiving CPUs, it usually results in a relatively fair balancing.

Several other queue-mapping strategies besides RSS are possible. The Intel 82599 for example supports using 8 queues(each with a different frame priority) per CPU, with up to 16 CPUs(128 queues total), and various modes for Data Center Bridging, special packets(eg. TCP SYN packets) and more.

Assigning a queue to a single processor has the added benefit that locking at

the queue level is no longer required between host CPUs, as only a single CPU

will ever work with the queue. This removes all contention overhead.

(25)

2.4 Modern network card architecture 13

2.4.2 Egress path

The egress path a modern network interface card is many ways conceptually similar to the ingress path. Frame transmission is done using multiple separate, independent, transmit queues. As in the receive case, transmit queues are im- plemented as ring buffers, with a bounded number of slots. Each slot can be filled with (part of) a frame to be transmitted. The network interface card will fetch as many descriptors(and the data they point to) as required to assemble an entire frame. Upon completion of the frame, it is transmitted to the network.

One the ingress path, Receive Side Scaling is used to load-balance between queues. On the egress path, a "reversed" scheme is used. The host is responsible for programming the network interface card, telling it what queues to prefer when sending. Several options are possible here, for example plain and simple round-robin, or more advanced schemes based weighted eg. round-robin or strict priority. When sending, the network interface card will pull descriptors out of the transmit rings in the programmed order, thus achieving load balancing between sending CPU cores.

Once a packet has been transmitted, the network interface can use one of several mechanisms to signal completion, eg. interrupts, writing a new ring buffer head pointer, or updating the descriptors it processed.

As was the case with mapping a single processor to each ingress queue, mapping

a single processor to an egress queue removes locking requirements between

cores, removing contention overhead.

(26)

14 Networks and protocols

(27)

Chapter 3

A basic framework for protocol implementations

In this section, we will discuss the various components required to support high performance packet processing in a network stack. The focus here is not on specific protocols, but rather on a general model of packet processing. We will start out by exploring how packet payloads can be managed and modified, and will then continue on to how processing elements can be defined to make use of this infrastructure. We will also look at how concurrency can be handled, as it poses a non-trivial problem but is vital for achieveing high performance on modern multi-core machines.

3.1 Prototype implementation basics

The Trokis prototype framework is implemented as a pure header library in

C++11. A pure header library was chosen because it is very easy to work with

for the programmer, requiring no special linker options or other compile time

options. This does increase compiliation times considerable as everything is

pulled together in a single translation unit, forcing a complete recompile in all

cases. The performance of the compiled program, when assuming that link-time

optimizations are available, should be similar in both cases.

(28)

16 A basic framework for protocol implementations

3.2 Packet payload management

In order to process packets, we need a way to manage packet payloads. Since multiple different protocols will need to interoperate, we need a model that is generic enough to support all common patterns used in communication proto- cols. We also need the interface to be portable, so that the network stack can be used on big and little endian machines with no modifications.

At the lowest level, we can simply think of a packet payload as a series of bytes representing some data. Most protocols today are based on either prepending headers or appending footers to data that they encapsulate. Examples include all protocols in the IP suites(including IPv4, IPv6, TCP, UDP and ICMP), along with the IEEE 802.3 ethernet protocol. Our abstraction must efficiently support these functionalities.

For this purpose, we define packet_data. The packet_data interface encapsu- lates a sequence of bytes contained between two points known as Head and Tail.

The Head point signifies the begining of a packet payload, while the Tail point signifies the end of the payload. The area between these two points is know as the Payload area. The user can grow the payload area infinitely in both directions, and can of course also shrink the payload area to a minimum size of zero bytes(no payload). Resizing can occur indepedently at either end. All positions before the Head and after the Tail pointers are undefined, but may be present in memory. A visualization of the abstraction can be seen in Figure 3.1.

It is important to note that the Head and Tail points are not pointers, but are simply known indicies for the programmer. The Head point by definition always has the index zero, while the Tail pointer has the index length − 1.

Payload

Head Tail

Head room Tail room

index = 0 index = length - 1

Figure 3.1: packet_data abstraction. The Head and Tail pointers can be moved independently, growing or shrinking the payload area in the given direction.

The Head and Tail points can be moved using the functions push, pull, put, and

trim. The push and pull functions move the head point respectively backwards

and forwards, ie. extends and shrinks the payload area. Likewise, the put and

trim functions move the Tail point forwards and backwards, ie. extends and

shrinks the payload area. This interface is very intuitive to use, and eg. allows

(29)

3.2 Packet payload management 17

the programmer of an IPv4 processing block to call push(20), prepending a 20-byte header to the current payload. It should be noted that when expanding the payload area, the values of the bytes in the expanding area are undefined, and it is thus up to the programmer to fill them in. Zeroing out these bytes would waste time, as the programmer will most likely fill them in with new values in any case.

The actual implementation backing packet_data is not specified by the inter- face, but it is specified that it must consist of a limited number of byte arrays.

The interface provides an interface for retrieveing the list of byte arrays backing the packet_data, but these byte arrays should not be modified directly. This in- terface is provided for the few cases that need raw access to the underlying data.

This would mainly be the case when moving data outside the infrastructure we provide.

The advantage of this data management scheme is that is completely frees the programmer from having to worry about the underlying memory structure of the packet data, and likewise it gives complete freedom for implementing the backing data storage for the payload. For example, the implementation could reserve extra space at the head of the packet, allowing protocol headers to be appended without having to move or copy the data in the payload. Allowing fragmented data buffers also becomes trivial, as no part of the upper level interface will have to change. This can potentially have huge performance benefits in some situations, because is removes or reduces the need for copying the entire backing buffer when segments are added.

On top of packet_data is a layer of helper functions for getting and setting bytes in the packet payload, along with a function for performing boundary checks. Letting the programmer explicitly handle boundary checks allows bet- ter performance because boundaries can be verified just once and then simply assumed to be correct from there on out. The byte getting and setting helpers are implemented in C++ as a very small template library operating on pure iterators. This allows the compiler maximum freedom to optimize and inline the code when it deems this to be more efficient. Wrappers around this library is also present directly on the packet_data interface. This does have the draw- back that the algorithms(the helpers) and the data(the payload contained in packet_data) is not entirely decoupled, but it seems to feel more natural for the programmer, especially for one coming from an object oriented background, to have the functions as members of the packet_data structure.

The helpers allow type safe reading and translation of integers in the byte stream

to host endianness integers. This is important for portably, as it allows process-

ing blocks to be written in portable code rather than having to assume some

byte order, or having to perform checks in the code. This does come at a po-

(30)

18 A basic framework for protocol implementations

tential performance cost, but it was felt that this cost was generally very low and could easily be justified by the increased productivity gains. Most of the performance implications can be hidden or entirely removed using processor spe- cific instructions for performing byte swapping. The current version also simply reads single bytes, rather than reading the largest possible type(for example, reading 32 bit numbers as one 4-byte word instead of as four 1-byte words).

This version is however simpler to code, and should be easy to change if it proves to be a bottleneck.

An early prototype of the helpers used a much more elaborate scheme, where entire on-wire structures could be described at bit resolution and decoded. This was based on heavy C++ meta template programming and would potentially have allowed fully type safe decoding of structures, rather than the non-portable use of bit fields and functions for translating between byte orders as is often seen in low-level C code. This approach however had some even larger performance implications, as there was no obvious way for the compiler to perform opti- mizations across several reads from the structure. Allowing the programmer to perform these optimizations manually would have removed the type safety features, and essentially nullified the entire concept. After some deliberation, it was decided that the before mentioned approach of providing simple integer helpers would be a better fit for our implementation, as it does not hide when reads or writes are made and allows the programmer to perform these optimiza- tions themselves if they so choose. This also forces the programmer to think about when packet data is accessed, which we consider to be a useful effect.

Each live packet in the system is managed by an instance of the packet class.

The packet class is really just at POD(plain-old-data)-type encapsulating a pointer to the assigned packet_data object, as well as a few very basic pieces of meta data. The meta data included is currently limited to:

worker The ID of the worker that is currently assigned to this packet. This field is used mainly for implementing concurrency and is explained further in Section 3.4.

dev The ID of the "current" device of the packet. The meaning of this field is dependent on which direction the packet is travelling through the stack.

When a packet arrives on an interface, the dev field is filled with the interface ID of the device that received the packet. When the packet is getting transmitted, the dev field is filled with the device ID of the device that should transmit the packet. This field is in the packet type because it is so central to the function of the network stack, and the notion of the networking device(or interface) generally transcends the layers of the stack.

More fields could potentially be added to the packet type, but in most cases pa-

(31)

3.3 Processing elements 19

rameters should be moved between blocks using function call parameters rather than packet state. This improves flexibility because blocks only need very basic knowledge of the packet type. Generally, only fields that transcend all layers of the protocol stack should be added to the packet type. The device ID is an example of such a field. For example, the IPv6 protocol generates link-local addresses that are meaningless without knowing the physical interface, but the IPv6 addresses will never be important to the IPv4 module. In this example, IPv6 addresses should be passed via function call arguments, while the device could be transported via the packet type. Adding a field to the packet type re- duces flexibility, but does give a potential performance increase as it potentially frees up a register when performing calls between processing blocks.

Note that there are no pointers to block-specific datastructures in the packet type. This is again to maximize flexibility. While moving pointers directly to the packet type would result in better performance when looking up configuration data, for example being able to immediately dereference the worker or dev fields as pointers, this introduces tighter coupling between the components of the stack. Instead, we force processing blocks to implement their own indirection tables. Several blocks that are designed to work together could of course share their underlying data structures in this regard, but our design tries to encourage each block to be independent if possible.

In order to allow high performance moving of packets between different parts of the stack, we always pass pointers to encapsulate packet objects. We simply use standard C++11 std::unqiue_ptr, because this ensures that the object is always released when the pointer goes out of scope. Using rvalue-references we can safely move the packet data around the stack, ensuring that a packet is never has multiple owners. This also means that we cannot simply copy packets, but there seems to be no need for this in most cases. Sniffers and a "tee" processing block would require the ability to copy a packet, but these could simply spawn a new packet of the correct length and copy the data from the original packet to the new packet. The extra safety gained from not sharing data outweighs this drawback.

3.3 Processing elements

An important aspect of a protocol stack is the ability to compose various pro-

tocols to form a complete stack. For example, an application needs to be able

to use TCP, which needs to be able to run on top of IPv4, which in turn needs

to be able to run on top of Ethernet. The interfaces between each of these

components varies, and in some cases more steps are needed in between them,

(32)

20 A basic framework for protocol implementations

in order for correct protocol operation.

Instead of defining a common interface that all protocols at a given layer must use to communicate with surrounding layers, we instead define a set of rules on how a protocol should be implemented in order for other protocols to interact with it.

At the heart of our design are protocols, which we encapsulate in generic units called "Proessing elements" or "Processing blocks". Each block is a C++ class, encapsulating all state information that the block needs in order to perform its task. We do not use dynamic dispatch(virtual function calls) with these classes, but instead treat them more as namespaces that we can spawn instances of.

This has the benefit the functions and state are contained in a single unit.

This is ideal for our project, as we want processing blocks to be maximally interchangable with as little modification to processing blocks as possible.

Processing block

transmit inform

request receive

Figure 3.2: Generic processing block. Pointers to the block are function calls(entry points), pointers away are hooks(function pointers).

No functions or hooks are required, the named ones are simply those that will be commonly defined.

Each block has a number of input functions and a number of hooks, as illus- trated in Figure 3.2. The input functions are the interface that other blocks(and external components) use when they need to use the services of a given process- ing block, or when they need to send configuration messages to the block. Each input has a clearly defined function signature, but each block can define exactly the functions needed for operation. The semantics of a given member function is also left up to the implementation of the block.

Perhaps more interestingly, blocks can also have a number of hooks. Hooks are

(33)

3.3 Processing elements 21

just function pointers, implemented in our solution as C++11 std::function objects. Hooks are set during setup of the stack, and never changed after this.

This means that we do not have to surround the function objects with locking, atomic pointers, etc.

An object can have any number of hooks, and will typically also have a number of setters for setting each of these hooks. Blocks will also typically provide a default implementation of a given hook. Since hooks are not plain pointers, we can bind not only plain functions but also member functions(both virtual and not), function objects(functors), and anonymous functions(lambdas/closures).

We also get a compile time guarantee that the function that we are assigning to a hook is compatible with the hook signature. An alternative to this approach would be to simply use the object oriented features of C++, namely (multi- ple) inheritance and virtual function calls(runtime polymorphism). There are numerous reasons why we picked function pointers over virtual function calls:

• Functions, not objects, are the basic building blocks of data processing.

Functions perform actions, while objects simply encapsulate functions and data.

• Function pointers can be potentially as efficient as virtual calls(since vir- tual calls are typically implemented as a table of function pointers).

• In some cases, we may wish to map the hook of one processing block to a method on another processing block, in which case the function call signature of the hook and method need to match. Were we using virtual functions, we would need to implement a class with a single method to perform this mapping. Using function pointers, the mapping could be done using an std::bind expression or an anonymous lambda function, directly at the location where we set the hook. This makes it easier to see what is going on.

• The function pointers can point to both member and non-member func- tions, allowing easier interfacing with C code if so desired.

There are some drawbacks of this approach. Performance wise, both virtual

function calls and function pointers suffer compared to direct function calls. This

is because direct function calls do not involve any kind of runtime lookup in order

to get the address that must be called to, resulting in both fewer instructions and

better branch prediction. Furthermore, direct function calls allow the compiler

to perform inlining, entirely eliminating both parameter passing and call setup

costs. Direct calls are however simply unacceptable in our system, because they

by their very nature cannot be changed. A hybrid approach could most likely be

created, where the type of each hook was passed as a C++ template argument

to a processing block. This would allow the compiler to treat calls as direct and

thus allow inlining, at the cost of vastly increased compilation times and possible

code bloat resulting from the large amount of template instantiations. It is

(34)

22 A basic framework for protocol implementations

doubtful that this would result in a net increase in performance, as the the extra code size could potentially lead to more instruction cache misses which would have to be contrasted with the already relatively low performance overhead of a virtual function call. Finally it should be noted that the code pointed to by most hooks will simply be too large to inline, and that the lambda functions or std::bind expressions that are used to adapt the call from hook to function will be partially inliniable(the call to the adaptor will be virtual but the call from the adaptor to the target will/can be direct). The relative overhead of performing a virtual call should therefore be very small, and unlikely a problem.

Relative to virtual function calls, std::bind expressions can quickly become hard to read, and current lambda functions in C++11 do not allow polymor- phism, forcing us to type out complete types for all the hook parameters that we are mapping. This quickly becomes tedious, error prone and hard to read.

With C++14, lambda functions will allow polymorphism, allowing types to be specified simply as "auto". We however feel that the advantages by far outweigh the negative implications of using virtual functions.

The processing blocks in our system are not designed to be exchanged at run- time. Once the system is started and initialized, the layout of the processing blocks should remain constant. This allows frees us from using atomics or lock- ing when dereferencing hook function pointers.

In general, our approach does not force any specific interface for processing blocks, but many protocol blocks need ways to communicate with upper and lower layers, and for this reason we have selected a standard naming convention for these hooks and functions. For introducing data to a given processing block, we have two functions named follows:

Receive A lower layer calls Receive to indicate that it wishes to communicate data to upper layers. With the call it passes the pointer to a packet along with any required data to fulfil the interface specified by the receiving processing block.

Transmit A upper layer calls Transmit to indicate that it wishes to communi- cate data to lower layers. A pointer to a packet, along with any additional arguments, are supplied as parameters.

Furthermore we have two hooks, named loosely after the naming convention used in the IEEE 802® standards:

Request The Request hook is invoked when a processing block wishes to com- municate with a lower layer protocol. The packet pointer along with any parameters required by the lower level interface are passed on.

Inform When a processing block wishes to communicate data to upper layers,

(35)

3.4 Handling concurrency 23

it invokes the Inform hook. Again, a packet pointer and additional data is passed.

Conceptually, we can think of the Inform hook of a layer N processing block being connected to the Receive member of a layer N+1 processing block, while the Request hook of a layer N processing block will be connected to the Receive member of a N-1 processing block. In practice, adapters may be introduced between the processing blocks, for example other processing blocks or plain functions.

Configuration of a processing block is done outside of the normal flow of the processing block framework. This allows each processing block to handle con- figuration in the most efficient way possible. For example, the configuration of a generic block that simply dispatches based on a parameter would be very different from the configuration of a block that handles the IPv4 protocol. By leaving configuration up to the processing block implementer, we give them the flexibility do select the best possible configuration scheme. This also allows us to perform complicated setup actions when required. For example, when an interface is registered we may need to set up IPv4 and IPv6 for the interface.

Forcing a particular flow would decrease the applicability of our processing block components.

3.4 Handling concurrency

An important problem in scaling a network stack is how concurrency will be handled. Modern network interface cards support multiple receive queues, al- lowing packets to be steered directly to a specific CPU on the receive path and allowing each CPU to be allocated dedicated transmission queues on the trans- mit path. In order to take advantage of these, relevant parts of of our network stack must be thread-safe.

Packets in our network stack are always handled by only a single thread at any given time. This is enforced by the use of std::unique_ptr, as long as we do not pass references to other threads. This frees us from having to use locks or even atomics around packet objects and the packet_data subobjects, allowing maximum performance when manipulation these objects.

When implementing thread-safe code, there are two overarching paradigms to

consider: blocking and non-blocking algorithms. Blocking algorithms using lock-

ing to provide mutual exclusion between threads, making sure that only a single

thread is ever in the locked(critical) section of code. This has obvious conse-

(36)

24 A basic framework for protocol implementations

quences for scalability, especially when scaling to large amounts of processors, because potentially hundreds of processors could be waiting for a single proces- sor to leave the critical section. Blocking algorithms can generally not guarantee system wide progress, because the holder of the lock could be suspended indef- initely, for example by the operating system, waiting for an external resource, or by coding error(eg. an infinite loop). On the other hand are non-blocking algorithms. These algorithms typically work by using atomic CPU instructions, such as compare-and-swap, fetch-and-add or test-and-set in combination with full or partial memory barriers. These instructions ensure ordering between workers, allowing much better scalability than a lock-based solution. These are generally divided into wait-free and lock-free algorithms. Wait-free algorithms guarantee that all processors will make progress in a bounded number of steps, while lock-free algorithms only guarantee that the system as a whole will make progress.

While non-blocking algorithms guarantee progress and scalability, this says nothing about the actual performance of a given algorithm, and in general a blocking algorithm may be faster than a non-blocking algorithm. Furthermore, blocking algorithms are typically much easier to develop and reason about, and making them deadlock-safe is relatively easy, especially when using the C++11 lock library, utilizing RAII locks(automatically released at the end of a scope) and deadlock avoidance algorithms. Especially very short lock can po- tentially achieve very good performance, and the C++11 lock library includes std::mutex for performing this type of blocking. The performance of this mutex implementation depends on the operating system, but a modern operating sys- tem like Linux will typically provide a high performance mutex implementation.

In Linux this is provided by the futex(Fast Userspace muTEX) construct in the Native POSIX Threading Library(NTPL). Futex performs as much as possible in userspace, only calling the kernel when the lock is contended. It is possible that a spinlock will give better performance, but this is hard to know ahead of time and will most likely depend on the load of the system, as higher loads will presumably lead to a higher chance of the current lock holder being preempted, stalling other threads. Measurements can show if a given lock is a bottleneck.

Furthermore, reader-writer locking could be used to achieve parallelism between multiple readers. We have not used this scheme in our prototype(instead prefer- ing optimistic concurrency control for these cases), but with the introduction of C++14 std::shared_mutex this functionality may be included in the C++14 standard library making it much more widely applicable.

Because of the difficulty of knowning ahead of time if a given lock will be con-

tended, combined with the fact that blocking algorithms will often perform

better in the uncontended case, we have decided to use optimistic lock-free algo-

rithms when working with seldomly changed data, and blocking data structures

as an initial implementation in most parts of our system.

(37)

3.5 Safe memory reclamation 25

In our network stack we allow for both blocking and non-blocking approaches to be taken, leaving it up to the implementor of a given processing block to decide which solution is best for the given case, for example using a lock-free structure for the routing table(heavily read, seldom updated) and combining blocking and non-blocking structures for IPv4 packet reassembly. The basic framework does not require any type of thread-safety, as the packet objects are only operated on by a single thread and the processing blocks are connected before packet processing starts.

3.5 Safe memory reclamation

Non-blocking data structures give rise to a unique problem: when data is re- moved from a data structure, there may be other threads traversing the same data. If the data is simply deleted once removed from the data structure, the other threads will have the data disappear without their knowledge, resulting in undefined behavior and crashes or worse yet, security problems. This problem is related to the so-called "ABA problem" which happens from the use of the compare-and-swap paradigm. If a thread reads the value A, and performs some calculations and checks that the value is unchanged(is still A), another thread can change the value from A to some other value("B") and back to A, without the first thread noticing. This is mainly a problem when using compare-and- swap with pointers, as it is possible for the object A to replaced by some other object B, deallocated and a new object reallocated at the same address(for per- formance reasons, modern memory allocators will try to cache memory blocks if possible). The result can be lost updates or worse(security issues, crashes, etc.).

Both of these problems can be eliminated if it can be guaranteed that data struc- tures will never be deallocated while there are still threads traversing the data structures. The easiest way to ensure this is by using a garbage collector. While garbage collectors allow very easy management of memory, there are associated performs costs to using them. When using reference counting, memory is re- claimed exactly when it is no longer in use(thus reference counting is accurate), but there is a potentially high performance penalty resulting from contention for the atomic reference counters used in this scheme. An alternative solution is the use of a mark-and-sweep with eg. incremental or generational collection, like the Boehm garbage collector[29]. Garbage collection can provide very good performance, but there will always be a cost associated, especially in a language like C++ where only conservative collectors, like the Boehm collector, are really viable.

An alternative scheme is to simply defer data structure destruction long enough

(38)

26 A basic framework for protocol implementations

for the data to no longer be in use. There are various ways to accomplish this that don’t require holding any extra reference count, shadow stacks, haz- ard pinters, etc. but instead work "globally". Examples include Quiescent- State-Based Reclamation(QSBR), Read-Copy-Update(RCU) and Epoch-Based Reclamation(EBR)[30]. All of these work by in some way signaling that a given thread has reached a quiescent state, ie. a state where the thread was not in- side a critical section, and therefore must have dropped the reference to the data. Memory can then be reclaimed by registering callbacks to be called once all threads have progressed at least one quiescent state, ensuring that all ref- erences have been dropped. The performance of all of these schemes is very good, RCU being used as one of the primary synchronization primitives in the Linux kernel. The biggest benefit of all of these is that there is no per-pointer overhead: all pointers are raw pointers, and can be deferenced as normal point- ers, resulting in almost no performance penalty overall. RCU as used in the Linux kernel can take advantage of various kernel-only facilities, especially that it can disable preemption(in non-realtime kernels), disallow sleeping within the RCU critical section, and use the kernel processor preemption as quiescent state signalling. EBR and QSBR are more general than RCU, but overall work on similar principles.

All three of these are relatively complex, but we can create our own scheme

based on some of the ideas. Rather than using quiescent states, we simply defer

reclamation for a set amount of wall time, ten seconds in the current imple-

mentation. Once a piece of a data structure is ready for reclamation, we save

the current time and enqueue the pointer onto a list of objects ready to be

reclaimed. The list is a wait-free multiple producer-single consumer FIFO, al-

lowing maximum performance for the producers(forwarding threads). A garbage

reclaimation thread continuely pulls out a single object from the list and waits

until the garbage object is older than the timeout, at which point the object is

deleted and the reclaiming thread loops again. The garbage enqueue operation

is wrapped in a non-copyable, movable smart pointer, gc_ptr, made possible

by C++11 move semantics, allowing safe RAII-style management of the object

lifetime. The smart pointer maps an atomic pointer, and thus there is no perfor-

mance drop associated with managing the pointer. The pointer is non-coypable,

to force single-owner semantics. This is required to stop double-free errors. An

alternative would be to reference count the number of owners of an object, but

this would result in extra overhead in cases that do not need shared owner-

ship. Deletion has a small overhead, resulting from having to enqueue onto the

reclamation list, but this is wait-free and only takes a few instructions, but the

atomic exchange instruction used may of course cause a slowdown when many

threads are enqueuing at once. This could be amortized by using thread-local

deletion queues. In any case, the overhead of enqueuing onto the list should be

much lower than the actual call to delete, resulting in an extra performance

boost. Our solution retains the ability to use raw pointers.

(39)

3.5 Safe memory reclamation 27

Of course, this scheme will fail if a thread ever holds on to a deleted object for more than ten seconds, but given that the usage is network packet processing, this seems extremely easy to ensure. If a network packet is ever more than around a second old, interactivity will be completely destroyed and generally it is better to drop traffic this old. As an example, the CoDel queuing strategy drops traffic when it has resided in a queue for more than five milliseconds[31].

For this reason, it is very seldom that a packet processing step will need to hold on to a data structure for more than a few seconds, let alone ten. Simply using the fixed wall clock deferred time results in a vastly simpler implementation, that for almost all purposes in a network stack will be good enough.

Our memory reclamation scheme is not suited for extremely high garbage rates, because of the lag between enqueing and actual reclamation, but for most data structures we feel that the lag should pose very low overhead. On the other hand, our scheme will always reclaim the data after the threshold. This is a benefit over a scheme like RCU, which can potentially retain memory forever if a bug prevents even a single processor from reaching a quiescent state. Currently there is one major fault with our reclamation scheme, when reclaiming data structures with nested garbage collected pointers: when a garbage collected object is destroyed and its destructor called, new garbage collected objects can potentially be enqueued for reclamation. These will not be reclaimed before their own timeouts run out. This could potentially be a problem with deeply nested structures. The maximum reclamation time would in this case be n × T where n is the depth of the data structure being reclaimed and T is the garbage timeout threshold, with T equalling 10 seconds in our prototype. A solution could be to delete garbage directly when executing in the reclamation thread, but this is currently unimplemented in favor of simpler code. If this turns out to be a problem, a fix could be implemented.

Our safe memory reclamation scheme is of course optional for the processing blocks, and a given processing block may use any reclamation scheme demmed fit by the designer. This includes hazard pointers, pass the buck, proxy col- lection(very similar to our approach, but uses a reference count to a proxy object rather than a wall clock time based queue), any of the above mentioned quiescent-state-based algorithms, or some entirely different scheme.

Note that packet and packet_data objects do not need garbage collection, as

they are only operated on by a single thread at a time, hence there is no risk

of another thread having a reference to any of these structures when they are

freed.

(40)

28 A basic framework for protocol implementations

(41)

Chapter 4

Protocols

Based on the framework discussed previously, we will here discuss how IEEE 802.3, IPv4, and IPv6 can be implemented on top of the framework. We will also discuss how the network stack could be connected to external drivers and applications, as would be the case when used in a general purpose operating system.

4.1 Generic dispatch block

In several cases, for example in IEEE 802.3 and IP, a protocol produces a nu-

meric value that indicates the type of encapsulated data. For these cases, we

provide a generic block that receives a pointer to a packet along with the pro-

duced protocol number and remaining parameters and based on a lookup into

a table calls the upper layer protocol. Our implementation uses the std::map

type(typically a red-black tree) from the Standard Template Library as the

backing container. Since the STL is used, we can easily switch the container if

required, for example if it proves to be a bottleneck. We encapsulate each avail-

able handler in a std::function object, allowing us to store any function or

member with the given function prototype, including lambdas and std::bind

expressions. We also provide the possibility of setting a fallback handler that

will be called if no registered handler with the given protocol number is found.

Implementing a ﬂexible network stack

Implementing a flexible network stack

Lasse Bang Dalegaard

Technical University of Denmark Informatics and Mathematical Modelling

Building 321, DK-2800 Kongens Lyngby, Denmark Phone +45 45253351, Fax +45 45882673

reception@imm.dtu.dk

www.imm.dtu.dk IMM-B.Sc.-2013-0028

Summary (English)

In this project, we design and implement Trokis, a minimal set of components

that together can be used to create any network stack, ranging in scope from

small embedded systems to large general purpose systems, like servers and work

stations. We also implement several protocols on top of our framework, to

showcase the modularity of our approach.

ii

Summary (Danish)

I dette projekt har vi designet og implementeret Trokis, et minimalt sæt kompo-

nenter der tilsammen kan bruges til at skabe ekstremt fleksible netværksstakke,

både til indlejrede systemer og til systemer med brede formål, f.eks. servere eller

arbejdsstationer. For at vise modulariteten i vores metode designer og udvikler

vi desuden et sæt protokoller oven på Trokis.

iv

Preface

This results was prepared at the department of Informatics and Mathematical Modeling at the Technical University of Denmark, in partial fulfillment of the requirements of the degree of Bachelor of Science in Electrical Engineering.

This projects focuses on topics in computer science, some experience with ob- ject oriented programming, along with basic knowledge of networking systems, is assumed on the part of the reader.

2013-07-31 Lyngby

Lasse Bang Dalegaard

vi

Acknowledgements

I would like to thank my supervisor Sven Karlsson for input during the project.

viii

Contents

Summary (English) i

Summary (Danish) iii

Preface v

Acknowledgements vii

1 Introduction 1

2 Networks and protocols 3

2.1 Prevailing technologies . . . . 3

2.2 Packet network fundamentals . . . . 4

2.3 Protocol stack architecture . . . . 4

2.3.1 Concepts of layering . . . . 5

2.3.2 OSI layering model . . . . 6

2.3.3 TCP/IP model . . . . 7

2.3.4 IEEE 802® layers . . . . 8

2.3.5 End-to-end principle . . . . 10

2.4 Modern network card architecture . . . . 11

2.4.1 Ingress path . . . . 11

2.4.2 Egress path . . . . 13

3 A basic framework for protocol implementations 15 3.1 Prototype implementation basics . . . . 15

3.2 Packet payload management . . . . 16

3.3 Processing elements . . . . 19

3.4 Handling concurrency . . . . 23

3.5 Safe memory reclamation . . . . 25

x CONTENTS

4 Protocols 29

4.1 Generic dispatch block . . . . 29

4.2 IEEE 802.3 implementation . . . . 30

4.3 IPv4 implementation . . . . 30

4.4 IPv6 implementation . . . . 35

4.5 Interfacing between external systems . . . . 36

5 Testing 39 5.1 Functional testing of implementation . . . . 39

5.2 Performance testing . . . . 44

6 Relation to existing solutions 49 6.1 Linux and FreeBSD network stacks . . . . 49

6.2 The Click Modular Router . . . . 51

6.3 Lightweight ANA . . . . 53

7 Conclusion 55 8 Future work 57 A Test code and data 59 A.1 sweeper.sh . . . . 59

A.2 bootstrap_mean.r . . . . 60

A.3 Performance benchmark initial data set . . . . 60

A.4 Performance benchmark optimized data set . . . . 61

B Trokis test implementation 63

C Trokis test implementation with extra optimizations 149

Bibliography 235

Chapter 1

Introduction

and Lightweight ANA[3]. These systems are extremely flexible, allowing high performance and great modularity, but are also quite heavy weight for smaller systems.

The goal of this project is to implement a lightweight, flexible userspace network

protocol stack that can be adapted to a specific application with little or no

change to individual components. The stack should be flexible enough to allow

it to run completely in user-mode on any given operating system, ranging from

high performance servers to embedded device. We emphasize that the stack