Blog 1

bungalow, house, sea-1208505.jpg

Blog Heading

Currently, LinkedIn infrastructure is composed of hundreds of thousands of hosts across multiple data centers. Observability into our infrastructure makes it possible for us to focus on the health and performance of our critical services to provide the best experience to our members. With LinkedIn’s large infrastructure growth over the past few years, observability has become more critical to pinpoint the potential root causes for any infrastructure failure or anomaly. There are a few elegant in-house monitoring systems at LinkedIn that provide network switch level metrics, logs, and even flow-level visibility by sampling packets going through our network. However, all of these rely on sampling or some kind of periodic polling of data, which for any meaningful sampling rate generates a very large volume of data to be processed and analyzed.

This led us to start looking into better approaches to extract this observability information. We realized that the most reliable source for this would be to get this information from the servers themselves through a lightweight, widely deployed agent. This is where we started leveraging eBPF for the agent, by tapping into system events and reading system metrics directly to extract the desired information. With eBPF, instead of  tracing each packet, we are tracing syscalls at a layer closer to the application layer and summarizing the data in-kernel to have minimal tracing overhead. The eBPF agent has been named Skyfall and will be referred to as such throughout this post.

In this post, we will talk about the architecture of the Skyfall agent that is deployed across the fleet of hosts in our DCs, the data pipeline involved with it, and where this data can be leveraged. We will also talk about the challenges involved with deploying an always-running  agent at such a wide scale and what we have learned on our journey so far.

What is eBPF?

Classic BPF, which stands for Berkeley packet filter, originated from packet filtering in the early 1990s and was used in packet filtering tools, most popularly in tcpdump. eBPF was built on the same principles but has features beyond just packet filtering.

eBPF is an in-kernel execution engine with its own instruction set and further infrastructure around it, like maps and in-built helper functions, as well as a pseudo file system for pinning maps. It was developed to allow user-supplied programs to be attached to any code-path in the kernel in a secure and lightweight manner. For example, if we want to trace all force-kill events, we can simply add a hook into the kill syscall via eBPF to capture such events.

Currently in Linux, the most common probe types available are:

  • Kprobes: A debugging mechanism for the Linux kernel that enables dynamic breaking into any kernel routine and collects debugging and performance information non-disruptively.
  • Tracepoints: Tracepoints are a marker within the kernel source that, when enabled, can be used to hook into a running kernel at the point where the marker is located. They are more stable than Kprobes.
  • XDP: This hook is at the earliest point possible in the networking driver and triggers a run of the eBPF program on packet reception before it goes through any kernel networking stack.
  • TC classifier: Similar to XDP, this hook is attached at the network interface, but will run after the networking stack has done initial processing of the packet.

eBPF programs are mostly comprised of two programs, the kernel space program, which is the eBPF program running on the kernel and being run on events, and the user space program, which is responsible for loading the eBPF program into the kernel and setting up the associated maps.

How are we using eBPF?

The Skyfall agent runs on almost all servers within our datacenter fleet. Using eBPF, we are able to correlate kernel events with network flow data in real-time. In a traditional network monitoring setup, for instance, we would need to monitor all the network interfaces on a host. However, with eBPF we can simply look at the kernel state to get the necessary statistics directly, since the kernel already knows about the network traffic on the host. Looking at kernel state using eBPF helps in identifying which services, processes, and containers are participating in communication sessions objectively, with very low CPU overhead. This mapping of system events to network traffic provides our engineers with the multi-dimensional context necessary to reduce the entropy of monitored data.

Skyfall program details

The Skyfall agent hooks into following protocol-specific (TCP, UDP) lifecycle syscalls via kprobes and kretprobes to collect the desired  data:

  • tcp_set_state: For tracing  tcp state changes.
  • tcp_v4_connect, tcp_v6_connect: For all tcp connection attempts.
  • inet_csk_accept: For tcp accept events.
  • ip4_datagram_connect, ip6_datagram_connect: For UDP connect events.

The following TCP metrics are collected along with the traffic byte count:

  • Smoothed RTT (Round trip time): The predicted RTT value obtained by applying a smoothing factor to it,  which is also used to adjust the RTO (Retransmission timeout) value. We are collecting this metric to measure the contribution of the network to overall performance.
  • RTT variance: An indication of path jitter. TCP uses this value, combined with SRTT, to compute the RTO. We are collecting this metric to detect transient network issues.
  • Packetloss and Retransmits: These metrics are being collected to monitor network performance.
  • Sending congestion window size: Congestion window controls the number of packets a TCP flow may have in the network at any time.

We are able to extract the above TCP metrics via the following lines of eBPF code:

struct tcp_sock *tsk = (struct tcp_sock *) tcp_sk(sk);


    bpf_probe_read(&event.byte_count, sizeof(event.byte_count), &tsk->bytes_acked);

    bpf_probe_read(&event.srtt, sizeof(event.srtt), &tsk->srtt_us);

    bpf_probe_read(&event.rtt_var, sizeof(event.rtt_var), &tsk->mdev_us);

    bpf_probe_read(&event.total_retrans, sizeof(event.total_retrans), &tsk->total_retrans);

    bpf_probe_read(&event.snd_cwd, sizeof(event.snd_cwd), &tsk->snd_cwnd);

Here we are casting the Linux sock struct to a tcp_sock struct to allow for accessing the tcp specific fields and assigning them to corresponding fields in our event struct. Note that we are using the bpf helper bpf_probe_read, which allows us to safely read given size bytes from kernel space to read the tcp fields. There are a few other bpf_helper functions that can be used in eBPF programs to interact with the system, or with the context in which they work as mentioned here.

Skyfall architecture overview

  • Diagram of Skyfall architecture

The collected data is ingested into a highly scalable and efficient data collection pipeline, provided through our in-house inFlow collectors. InFlow is our network flow collection, aggregation, and visualization platform. The Skyfall agent encodes collected data into sflow datagrams using our custom XDR schema and sends this data to the InFlow collectors.

The collectors then internally aggregate some of the data to remove any redundancies and send it to our highly scalable in-house Kafka cluster. From here the data follows two main paths. First, it gets consumed by a Samza job, which is our stream processing system, that transforms the data and extracts service-to-service dependencies to determine the upstream and downstream dependencies for any given service, along with traffic attributes like Tx/Rx bytes, packet retransmission count and RTT. Then, it also gets ETL-ed into our HDFS datastore, on which we run various analytics jobs.

Flow aggregation

In the current deployment state, we are handling around 12M/s events being produced by Skyfall agents across the fleet, to the InFlow collectors. Post-aggregation, this drops to 1.4M/s messages being produced to Kafka. Currently, the aggregation logic keys each flow or event by the following 4-tuples of protocol, source IP, destination IP, service port. Here, we are identifying either the source port or the destination port to be the service port, and the other one will be identified as an ephemeral port. The logic of identifying which port acts as a service port is handled at the Skyfall agent itself, since we already maintain a list of listening ports on the host. On aggregation, we take the ephemeral port associated with the flow and append it to the existing ephemeral ports list for the aggregated event. This approach helped reduce the Kafka messages count by almost 70% compared to the simpler 5-tuple (protocol, source IP, source port, destination IP, destination port) based aggregation.

The following graph indicates the huge drop in the rate of Kafka messages produced after introducing the flow aggregation.

  • Kafka flow aggregation graph

On the x-axis we have timestamps. On the y-axis we have the number of messages produced to Kafka per second.

Challenges and learnings

Given the massive scale at which LinkedIn operates, deploying an always-running agent across hundreds of thousands of servers comes with a number of unprecedented challenges. In our journey with Skyfall, we have faced a few challenges.

Operating systems with different kernel versions

Torun the Skyfall agent across different kernel versions, we had to deal with various Linux kernel struct rearrangements and modifications across those versions. Unfortunately, since most of our machines don’t have BTF (BPF type format) support yet, we could not leverage the BPF CO-RE (Compile once run everywhere) approach. Fortunately, we only had to handle these modifications and struct rearrangements across a handful of different kernel versions. Because eBPF is usually compiled from C, using clang, and linked into an ELF (Executable and Linkable Format) binary file, we were able to compile the kernel-space program against a small number of different kernel headers to generate kernel specific ELF binary files. The user-space program only had to load the appropriate ELF binary during the agent’s startup, according to the kernel version of the host.

Handling performance overhead

Our initial implementation traced protocol-specific sendmsg and recvmsg functions, so the CPU overhead was proportional to the rate at which these functions were being called. On most servers with normal application workloads we saw very low CPU usage, under 3% of a single CPU core. On hosts running very high throughput applications and database services, we started seeing high CPU utilization with over 70k function calls being traced per second. The CPU usage on some of these hosts went above 50% of 1 CPU core.

To solve this problem, we switched to tracing lifecycle functions like protocol specific accept, connect, and close/disconnect functions, instead of the sendmsg and recvmsg functions, because the rate of these lifecycle functions is much smaller. This cut down the CPU overhead by almost 90%. On most of the hosts that were previously reporting higher CPU usage, it now runs well under 5% of a single CPU core. Along with this, we have also lowered the nice and ionice values for the agent process to lower its scheduling priority,  to reduce contention with application workloads, both in terms of CPU processor time and disk I/O load.

Where can this data be used?

The collected network flows along with the TCP/IP statistics such as retransmits, RTTs, and Tx/Rx bytes will be ingested by our analytics jobs to maximize coverage and get more granular insights into our traffic metrics, to aid in the following use cases:

  • Pinpointing network bottlenecks with granular app/service level information. This helps in determining whether the reported issue is caused by the network or by other elements of the stack.
  • Analyzing top traffic-intensive services and traffic patterns for capacity planning. During major switch link failures this helps in avoiding disruption of critical services.
  • Constructing a service-service dependency graph to identify highly interconnected services, services interacting across security zone boundaries, identifying unmanaged applications, and more.

What’s next?

We are further looking to leverage eBPF’s capabilities for tracing security events to capture suspicious and malicious activities on hosts, like privilege escalations, suspicious binary, and module loads. We will also be tracing via uprobes, which allow us to intercept a userspace program dynamically to trace shell command lines.

Conclusion

eBPF is a powerful tool for the observability space and helps us with collecting valuable data points with minimal overhead. It has been both a challenging and exciting journey working with eBPF at LinkedIn. With our initial efforts and learnings along the way, we have been able to establish a solid foundation for leveraging eBPF for a variety of use cases and are looking forward to building interesting tools on top of the rich dataset it provides.

Acknowledgements

Skyfall is the result of continued efforts from a lot of people within the Infrastructure Development team, both past and present. I would like to thank Varoun PHaribabu ViswanathanAnanya ShandilyaManish Arora, and Sunil Thunuguntla.

Leave a Comment

Your email address will not be published. Required fields are marked *